mbox series

[0/2] Fix Armada 38x mvneta lockups when switching speeds

Message ID 20200630160452.GD1551@shell.armlinux.org.uk
Headers show
Series Fix Armada 38x mvneta lockups when switching speeds | expand

Message

Russell King (Oracle) June 30, 2020, 4:04 p.m. UTC
Hi,

While testing phylink over the weekend, I found it was possible to
cause the mvneta hardware to lockup in various weird and wonderful
ways by switching the interface speed between 1G and 2.5G repeatedly.
It didn't require a rapid switching, but one switch every few seconds.

Symptoms included one or more of:
- Timeout while trying to stop transmit (seen once)
- 2500BASE-X link negotiation failure (fails to exchange link word.)
- Detects lack of sync, but fails to flag 10ms of sync failure.
- SyncOk bit randomly toggles.

Once the hardware gets into a "bad" state, trying to recover it by
using the mvneta GMAC port reset fails to resolve the issue.
Disabling the port also fails to recover it.  The only way to
recover seemed to be via a reboot.

Many solutions to solve this were tried in various combinations -
while changing the COMPHY configuration:
- putting the GMAC into reset
- disabling the GMAC port
- augmenting the COMPHY configuration to try to "cleanly" disable
  the COMPHY via phy_power_down() and reconfigure it via
  phy_power_up(), including resetting parts of the COMPHY and
  re-running the RX initialisation.

None of that worked.  It was then discovered from the u-boot sources
that there is an undocumented register that has a lane-specific bit
set at the end of COMPHY initialisation, once the loosely documented
COMPHY setup has completed.

Experimentation with that showed that if the lane specific bit is
cleared before changing the COMPHY "GEN" configuration, and set
afterwards, mvneta no longer locks up.

Unfortunately, this undocumented register is not part of the COMPHY
register set that we map - it is located in a region of "System
Registers" which are shared between multiple different devices.

Who should be responsible for mapping this register (mvneta or
COMPHY) was considered; the register is only present on Armada 38x
systems, and seemingly not on Armada 37x or Armada 37xx systems.
It seems that it is a system-level register.  The COMPHYs seem to
be system specific, so let's make it part of the COMPHY.

With no real information on this register, all we can do is guess
about it's function and how to fit it into the system.

I've mentioned this to Thomas Petazzoni on #mvlinux, but that has
not yet lead anywhere.

 .../bindings/phy/phy-armada38x-comphy.txt          | 10 ++++-
 arch/arm/boot/dts/armada-38x.dtsi                  |  3 +-
 drivers/phy/marvell/phy-armada38x-comphy.c         | 45 ++++++++++++++++++----
 3 files changed, 49 insertions(+), 9 deletions(-)

Comments

Russell King (Oracle) June 30, 2020, 4:06 p.m. UTC | #1
On Tue, Jun 30, 2020 at 05:05:38PM +0100, Russell King wrote:
> The mvneta hardware appears to lock up in various random ways when
> repeatedly switching speeds between 1G and 2.5G, which involves
> reprogramming the COMPHY.  It is not entirely clear why this happens,
> but best guess is that reprogramming the COMPHY glitches mvneta clocks
> causing the hardware to fail.  It seems that rebooting resolves the
> failure, but not down/up cycling the interface alone.
> 
> Various other approaches have been tried, such as trying to cleanly
> power down the COMPHY and then take it back through the power up
> initialisation, but this does not seem to help.
> 
> It was finally noticed that u-boot's last step when configuring a
> COMPHY for "SGMII" mode was to poke at a register described as
> "GBE_CONFIGURATION_REG", which is undocumented in any external
> documentation.  All that we have is the fact that u-boot sets a bit
> corresponding to the "SGMII" lane at the end of COMPHY initialisation.
> 
> Experimentation shows that if we clear this bit prior to changing the
> speed, and then set it afterwards, mvneta does not suffer this problem
> on the SolidRun Clearfog when switching speeds between 1G and 2.5G.
> 
> This problem was found while script-testing phylink.
> 
> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>

I forgot...

Fixes: 14dc100b4411 ("phy: armada38x: add common phy support")

> ---
>  arch/arm/boot/dts/armada-38x.dtsi          |  3 +-
>  drivers/phy/marvell/phy-armada38x-comphy.c | 45 ++++++++++++++++++----
>  2 files changed, 40 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm/boot/dts/armada-38x.dtsi b/arch/arm/boot/dts/armada-38x.dtsi
> index e038abc0c6b4..420ae26e846b 100644
> --- a/arch/arm/boot/dts/armada-38x.dtsi
> +++ b/arch/arm/boot/dts/armada-38x.dtsi
> @@ -344,7 +344,8 @@
>  
>  			comphy: phy@18300 {
>  				compatible = "marvell,armada-380-comphy";
> -				reg = <0x18300 0x100>;
> +				reg-names = "comphy", "conf";
> +				reg = <0x18300 0x100>, <0x18460 4>;
>  				#address-cells = <1>;
>  				#size-cells = <0>;
>  
> diff --git a/drivers/phy/marvell/phy-armada38x-comphy.c b/drivers/phy/marvell/phy-armada38x-comphy.c
> index 6960dfd8ad8c..0fe408964334 100644
> --- a/drivers/phy/marvell/phy-armada38x-comphy.c
> +++ b/drivers/phy/marvell/phy-armada38x-comphy.c
> @@ -41,6 +41,7 @@ struct a38x_comphy_lane {
>  
>  struct a38x_comphy {
>  	void __iomem *base;
> +	void __iomem *conf;
>  	struct device *dev;
>  	struct a38x_comphy_lane lane[MAX_A38X_COMPHY];
>  };
> @@ -54,6 +55,21 @@ static const u8 gbe_mux[MAX_A38X_COMPHY][MAX_A38X_PORTS] = {
>  	{ 0, 0, 3 },
>  };
>  
> +static void a38x_set_conf(struct a38x_comphy_lane *lane, bool enable)
> +{
> +	struct a38x_comphy *priv = lane->priv;
> +	u32 conf;
> +
> +	if (priv->conf) {
> +		conf = readl_relaxed(priv->conf);
> +		if (enable)
> +			conf |= BIT(lane->port);
> +		else
> +			conf &= ~BIT(lane->port);
> +		writel(conf, priv->conf);
> +	}
> +}
> +
>  static void a38x_comphy_set_reg(struct a38x_comphy_lane *lane,
>  				unsigned int offset, u32 mask, u32 value)
>  {
> @@ -97,6 +113,7 @@ static int a38x_comphy_set_mode(struct phy *phy, enum phy_mode mode, int sub)
>  {
>  	struct a38x_comphy_lane *lane = phy_get_drvdata(phy);
>  	unsigned int gen;
> +	int ret;
>  
>  	if (mode != PHY_MODE_ETHERNET)
>  		return -EINVAL;
> @@ -115,13 +132,20 @@ static int a38x_comphy_set_mode(struct phy *phy, enum phy_mode mode, int sub)
>  		return -EINVAL;
>  	}
>  
> +	a38x_set_conf(lane, false);
> +
>  	a38x_comphy_set_speed(lane, gen, gen);
>  
> -	return a38x_comphy_poll(lane, COMPHY_STAT1,
> -				COMPHY_STAT1_PLL_RDY_TX |
> -				COMPHY_STAT1_PLL_RDY_RX,
> -				COMPHY_STAT1_PLL_RDY_TX |
> -				COMPHY_STAT1_PLL_RDY_RX);
> +	ret = a38x_comphy_poll(lane, COMPHY_STAT1,
> +			       COMPHY_STAT1_PLL_RDY_TX |
> +			       COMPHY_STAT1_PLL_RDY_RX,
> +			       COMPHY_STAT1_PLL_RDY_TX |
> +			       COMPHY_STAT1_PLL_RDY_RX);
> +
> +	if (ret == 0)
> +		a38x_set_conf(lane, true);
> +
> +	return ret;
>  }
>  
>  static const struct phy_ops a38x_comphy_ops = {
> @@ -174,14 +198,21 @@ static int a38x_comphy_probe(struct platform_device *pdev)
>  	if (!priv)
>  		return -ENOMEM;
>  
> -	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> -	base = devm_ioremap_resource(&pdev->dev, res);
> +	base = devm_platform_ioremap_resource(pdev, 0);
>  	if (IS_ERR(base))
>  		return PTR_ERR(base);
>  
>  	priv->dev = &pdev->dev;
>  	priv->base = base;
>  
> +	/* Optional */
> +	res = platform_get_resource_byname(pdev, IORESOURCE_MEM, "conf");
> +	if (res) {
> +		priv->conf = devm_ioremap_resource(&pdev->dev, res);
> +		if (IS_ERR(priv->conf))
> +			return PTR_ERR(priv->conf);
> +	}
> +
>  	for_each_available_child_of_node(pdev->dev.of_node, child) {
>  		struct phy *phy;
>  		int ret;
> -- 
> 2.20.1
> 
>
Vinod Koul July 1, 2020, 6:57 a.m. UTC | #2
On 30-06-20, 17:05, Russell King wrote:
> The mvneta hardware appears to lock up in various random ways when
> repeatedly switching speeds between 1G and 2.5G, which involves
> reprogramming the COMPHY.  It is not entirely clear why this happens,
> but best guess is that reprogramming the COMPHY glitches mvneta clocks
> causing the hardware to fail.  It seems that rebooting resolves the
> failure, but not down/up cycling the interface alone.
> 
> Various other approaches have been tried, such as trying to cleanly
> power down the COMPHY and then take it back through the power up
> initialisation, but this does not seem to help.
> 
> It was finally noticed that u-boot's last step when configuring a
> COMPHY for "SGMII" mode was to poke at a register described as
> "GBE_CONFIGURATION_REG", which is undocumented in any external
> documentation.  All that we have is the fact that u-boot sets a bit
> corresponding to the "SGMII" lane at the end of COMPHY initialisation.
> 
> Experimentation shows that if we clear this bit prior to changing the
> speed, and then set it afterwards, mvneta does not suffer this problem
> on the SolidRun Clearfog when switching speeds between 1G and 2.5G.
> 
> This problem was found while script-testing phylink.
> 
> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
> ---
>  arch/arm/boot/dts/armada-38x.dtsi          |  3 +-

lgtm, i need ack for dts parts before I can apply this

>  drivers/phy/marvell/phy-armada38x-comphy.c | 45 ++++++++++++++++++----
>  2 files changed, 40 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm/boot/dts/armada-38x.dtsi b/arch/arm/boot/dts/armada-38x.dtsi
> index e038abc0c6b4..420ae26e846b 100644
> --- a/arch/arm/boot/dts/armada-38x.dtsi
> +++ b/arch/arm/boot/dts/armada-38x.dtsi
> @@ -344,7 +344,8 @@
>  
>  			comphy: phy@18300 {
>  				compatible = "marvell,armada-380-comphy";
> -				reg = <0x18300 0x100>;
> +				reg-names = "comphy", "conf";
> +				reg = <0x18300 0x100>, <0x18460 4>;
>  				#address-cells = <1>;
>  				#size-cells = <0>;
>  
> diff --git a/drivers/phy/marvell/phy-armada38x-comphy.c b/drivers/phy/marvell/phy-armada38x-comphy.c
> index 6960dfd8ad8c..0fe408964334 100644
> --- a/drivers/phy/marvell/phy-armada38x-comphy.c
> +++ b/drivers/phy/marvell/phy-armada38x-comphy.c
> @@ -41,6 +41,7 @@ struct a38x_comphy_lane {
>  
>  struct a38x_comphy {
>  	void __iomem *base;
> +	void __iomem *conf;
>  	struct device *dev;
>  	struct a38x_comphy_lane lane[MAX_A38X_COMPHY];
>  };
> @@ -54,6 +55,21 @@ static const u8 gbe_mux[MAX_A38X_COMPHY][MAX_A38X_PORTS] = {
>  	{ 0, 0, 3 },
>  };
>  
> +static void a38x_set_conf(struct a38x_comphy_lane *lane, bool enable)
> +{
> +	struct a38x_comphy *priv = lane->priv;
> +	u32 conf;
> +
> +	if (priv->conf) {
> +		conf = readl_relaxed(priv->conf);
> +		if (enable)
> +			conf |= BIT(lane->port);
> +		else
> +			conf &= ~BIT(lane->port);
> +		writel(conf, priv->conf);
> +	}
> +}
> +
>  static void a38x_comphy_set_reg(struct a38x_comphy_lane *lane,
>  				unsigned int offset, u32 mask, u32 value)
>  {
> @@ -97,6 +113,7 @@ static int a38x_comphy_set_mode(struct phy *phy, enum phy_mode mode, int sub)
>  {
>  	struct a38x_comphy_lane *lane = phy_get_drvdata(phy);
>  	unsigned int gen;
> +	int ret;
>  
>  	if (mode != PHY_MODE_ETHERNET)
>  		return -EINVAL;
> @@ -115,13 +132,20 @@ static int a38x_comphy_set_mode(struct phy *phy, enum phy_mode mode, int sub)
>  		return -EINVAL;
>  	}
>  
> +	a38x_set_conf(lane, false);
> +
>  	a38x_comphy_set_speed(lane, gen, gen);
>  
> -	return a38x_comphy_poll(lane, COMPHY_STAT1,
> -				COMPHY_STAT1_PLL_RDY_TX |
> -				COMPHY_STAT1_PLL_RDY_RX,
> -				COMPHY_STAT1_PLL_RDY_TX |
> -				COMPHY_STAT1_PLL_RDY_RX);
> +	ret = a38x_comphy_poll(lane, COMPHY_STAT1,
> +			       COMPHY_STAT1_PLL_RDY_TX |
> +			       COMPHY_STAT1_PLL_RDY_RX,
> +			       COMPHY_STAT1_PLL_RDY_TX |
> +			       COMPHY_STAT1_PLL_RDY_RX);
> +
> +	if (ret == 0)
> +		a38x_set_conf(lane, true);
> +
> +	return ret;
>  }
>  
>  static const struct phy_ops a38x_comphy_ops = {
> @@ -174,14 +198,21 @@ static int a38x_comphy_probe(struct platform_device *pdev)
>  	if (!priv)
>  		return -ENOMEM;
>  
> -	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> -	base = devm_ioremap_resource(&pdev->dev, res);
> +	base = devm_platform_ioremap_resource(pdev, 0);
>  	if (IS_ERR(base))
>  		return PTR_ERR(base);
>  
>  	priv->dev = &pdev->dev;
>  	priv->base = base;
>  
> +	/* Optional */
> +	res = platform_get_resource_byname(pdev, IORESOURCE_MEM, "conf");
> +	if (res) {
> +		priv->conf = devm_ioremap_resource(&pdev->dev, res);
> +		if (IS_ERR(priv->conf))
> +			return PTR_ERR(priv->conf);
> +	}
> +
>  	for_each_available_child_of_node(pdev->dev.of_node, child) {
>  		struct phy *phy;
>  		int ret;
> -- 
> 2.20.1
Russell King (Oracle) July 10, 2020, 3:19 p.m. UTC | #3
On Wed, Jul 01, 2020 at 12:27:27PM +0530, Vinod Koul wrote:
> On 30-06-20, 17:05, Russell King wrote:
> > The mvneta hardware appears to lock up in various random ways when
> > repeatedly switching speeds between 1G and 2.5G, which involves
> > reprogramming the COMPHY.  It is not entirely clear why this happens,
> > but best guess is that reprogramming the COMPHY glitches mvneta clocks
> > causing the hardware to fail.  It seems that rebooting resolves the
> > failure, but not down/up cycling the interface alone.
> > 
> > Various other approaches have been tried, such as trying to cleanly
> > power down the COMPHY and then take it back through the power up
> > initialisation, but this does not seem to help.
> > 
> > It was finally noticed that u-boot's last step when configuring a
> > COMPHY for "SGMII" mode was to poke at a register described as
> > "GBE_CONFIGURATION_REG", which is undocumented in any external
> > documentation.  All that we have is the fact that u-boot sets a bit
> > corresponding to the "SGMII" lane at the end of COMPHY initialisation.
> > 
> > Experimentation shows that if we clear this bit prior to changing the
> > speed, and then set it afterwards, mvneta does not suffer this problem
> > on the SolidRun Clearfog when switching speeds between 1G and 2.5G.
> > 
> > This problem was found while script-testing phylink.
> > 
> > Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
> > ---
> >  arch/arm/boot/dts/armada-38x.dtsi          |  3 +-
> 
> lgtm, i need ack for dts parts before I can apply this

I'm not sure what the situation is for Bootlin, but they don't seem to
be very responsive right now (covid related?)

What I know from what I've been party to on netdev is that Bootlin
sent a patch for the MVPP2 driver, and the very next day someone
reported that the patch caused a bug.  Unfortunately, the patch got
picked up anyway, but there was no response from Bootlin.  After a
month or so, -final was released containing this patch, so now it
had become a regression - and still no response from Bootlin.

Eventually the bug got fixed - not because Bootlin fixed it, but
because I ended up spending the time researching how that part of
the network driver worked, diagnosing what was going on, and
eventually fixing it in the most obvious way - but it's not clear
that the fix was the right approach.  Bootlin never commented.  See
3138a07ce219 ("net: mvpp2: fix RX hashing for non-10G ports").

So, I think we have to assume that Bootlin are struggling right now,
and as it's been over a week, it's unlikely that they are going to
respond soon.  What do you think we should do?

I also note that Rob has not responded to the DT binding change
either, despite me gently prodding, and Rob processing a whole raft
of DT binding stuff yesterday.

I can split the DTS change from the rest of the patch, but I don't
think that really helps without at least the binding change being
agreed.
Vinod Koul July 13, 2020, 6:18 a.m. UTC | #4
On 10-07-20, 16:19, Russell King - ARM Linux admin wrote:
> On Wed, Jul 01, 2020 at 12:27:27PM +0530, Vinod Koul wrote:
> > On 30-06-20, 17:05, Russell King wrote:
> > > The mvneta hardware appears to lock up in various random ways when
> > > repeatedly switching speeds between 1G and 2.5G, which involves
> > > reprogramming the COMPHY.  It is not entirely clear why this happens,
> > > but best guess is that reprogramming the COMPHY glitches mvneta clocks
> > > causing the hardware to fail.  It seems that rebooting resolves the
> > > failure, but not down/up cycling the interface alone.
> > > 
> > > Various other approaches have been tried, such as trying to cleanly
> > > power down the COMPHY and then take it back through the power up
> > > initialisation, but this does not seem to help.
> > > 
> > > It was finally noticed that u-boot's last step when configuring a
> > > COMPHY for "SGMII" mode was to poke at a register described as
> > > "GBE_CONFIGURATION_REG", which is undocumented in any external
> > > documentation.  All that we have is the fact that u-boot sets a bit
> > > corresponding to the "SGMII" lane at the end of COMPHY initialisation.
> > > 
> > > Experimentation shows that if we clear this bit prior to changing the
> > > speed, and then set it afterwards, mvneta does not suffer this problem
> > > on the SolidRun Clearfog when switching speeds between 1G and 2.5G.
> > > 
> > > This problem was found while script-testing phylink.
> > > 
> > > Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
> > > ---
> > >  arch/arm/boot/dts/armada-38x.dtsi          |  3 +-
> > 
> > lgtm, i need ack for dts parts before I can apply this
> 
> I'm not sure what the situation is for Bootlin, but they don't seem to
> be very responsive right now (covid related?)
> 
> What I know from what I've been party to on netdev is that Bootlin
> sent a patch for the MVPP2 driver, and the very next day someone
> reported that the patch caused a bug.  Unfortunately, the patch got
> picked up anyway, but there was no response from Bootlin.  After a
> month or so, -final was released containing this patch, so now it
> had become a regression - and still no response from Bootlin.
> 
> Eventually the bug got fixed - not because Bootlin fixed it, but
> because I ended up spending the time researching how that part of
> the network driver worked, diagnosing what was going on, and
> eventually fixing it in the most obvious way - but it's not clear
> that the fix was the right approach.  Bootlin never commented.  See
> 3138a07ce219 ("net: mvpp2: fix RX hashing for non-10G ports").
> 
> So, I think we have to assume that Bootlin are struggling right now,
> and as it's been over a week, it's unlikely that they are going to
> respond soon.  What do you think we should do?
> 
> I also note that Rob has not responded to the DT binding change
> either, despite me gently prodding, and Rob processing a whole raft
> of DT binding stuff yesterday.
> 
> I can split the DTS change from the rest of the patch, but I don't
> think that really helps without at least the binding change being
> agreed.

I would prefer splitting, you may sent the DTS to arm arch folks if no
response from subarch folks
Gregory CLEMENT July 13, 2020, 3:36 p.m. UTC | #5
Hello,

> On 10-07-20, 16:19, Russell King - ARM Linux admin wrote:
>> On Wed, Jul 01, 2020 at 12:27:27PM +0530, Vinod Koul wrote:
>> > On 30-06-20, 17:05, Russell King wrote:
>> > > The mvneta hardware appears to lock up in various random ways when
>> > > repeatedly switching speeds between 1G and 2.5G, which involves
>> > > reprogramming the COMPHY.  It is not entirely clear why this happens,
>> > > but best guess is that reprogramming the COMPHY glitches mvneta clocks
>> > > causing the hardware to fail.  It seems that rebooting resolves the
>> > > failure, but not down/up cycling the interface alone.
>> > > 
>> > > Various other approaches have been tried, such as trying to cleanly
>> > > power down the COMPHY and then take it back through the power up
>> > > initialisation, but this does not seem to help.
>> > > 
>> > > It was finally noticed that u-boot's last step when configuring a
>> > > COMPHY for "SGMII" mode was to poke at a register described as
>> > > "GBE_CONFIGURATION_REG", which is undocumented in any external
>> > > documentation.  All that we have is the fact that u-boot sets a bit
>> > > corresponding to the "SGMII" lane at the end of COMPHY initialisation.
>> > > 
>> > > Experimentation shows that if we clear this bit prior to changing the
>> > > speed, and then set it afterwards, mvneta does not suffer this problem
>> > > on the SolidRun Clearfog when switching speeds between 1G and 2.5G.
>> > > 
>> > > This problem was found while script-testing phylink.
>> > > 
>> > > Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
>> > > ---
>> > >  arch/arm/boot/dts/armada-38x.dtsi          |  3 +-
>> > 
>> > lgtm, i need ack for dts parts before I can apply this
>> 
>> I'm not sure what the situation is for Bootlin, but they don't seem to
>> be very responsive right now (covid related?)
>> 
>> What I know from what I've been party to on netdev is that Bootlin
>> sent a patch for the MVPP2 driver, and the very next day someone
>> reported that the patch caused a bug.  Unfortunately, the patch got
>> picked up anyway, but there was no response from Bootlin.  After a
>> month or so, -final was released containing this patch, so now it
>> had become a regression - and still no response from Bootlin.
>> 
>> Eventually the bug got fixed - not because Bootlin fixed it, but
>> because I ended up spending the time researching how that part of
>> the network driver worked, diagnosing what was going on, and
>> eventually fixing it in the most obvious way - but it's not clear
>> that the fix was the right approach.  Bootlin never commented.  See
>> 3138a07ce219 ("net: mvpp2: fix RX hashing for non-10G ports").
>> 
>> So, I think we have to assume that Bootlin are struggling right now,
>> and as it's been over a week, it's unlikely that they are going to
>> respond soon.  What do you think we should do?
>> 
>> I also note that Rob has not responded to the DT binding change
>> either, despite me gently prodding, and Rob processing a whole raft
>> of DT binding stuff yesterday.
>> 
>> I can split the DTS change from the rest of the patch, but I don't
>> think that really helps without at least the binding change being
>> agreed.
>
> I would prefer splitting, you may sent the DTS to arm arch folks if no
> response from subarch folks

Yes please could you split the patch to put the dts apart ? And if the
binding is accepted we will apply it.

Thanks,

Gregory


>
> -- 
> ~Vinod
Russell King (Oracle) July 13, 2020, 5:21 p.m. UTC | #6
On Mon, Jul 13, 2020 at 05:36:54PM +0200, Gregory CLEMENT wrote:
> Hello,
> 
> > On 10-07-20, 16:19, Russell King - ARM Linux admin wrote:
> >> On Wed, Jul 01, 2020 at 12:27:27PM +0530, Vinod Koul wrote:
> >> > On 30-06-20, 17:05, Russell King wrote:
> >> > > The mvneta hardware appears to lock up in various random ways when
> >> > > repeatedly switching speeds between 1G and 2.5G, which involves
> >> > > reprogramming the COMPHY.  It is not entirely clear why this happens,
> >> > > but best guess is that reprogramming the COMPHY glitches mvneta clocks
> >> > > causing the hardware to fail.  It seems that rebooting resolves the
> >> > > failure, but not down/up cycling the interface alone.
> >> > > 
> >> > > Various other approaches have been tried, such as trying to cleanly
> >> > > power down the COMPHY and then take it back through the power up
> >> > > initialisation, but this does not seem to help.
> >> > > 
> >> > > It was finally noticed that u-boot's last step when configuring a
> >> > > COMPHY for "SGMII" mode was to poke at a register described as
> >> > > "GBE_CONFIGURATION_REG", which is undocumented in any external
> >> > > documentation.  All that we have is the fact that u-boot sets a bit
> >> > > corresponding to the "SGMII" lane at the end of COMPHY initialisation.
> >> > > 
> >> > > Experimentation shows that if we clear this bit prior to changing the
> >> > > speed, and then set it afterwards, mvneta does not suffer this problem
> >> > > on the SolidRun Clearfog when switching speeds between 1G and 2.5G.
> >> > > 
> >> > > This problem was found while script-testing phylink.
> >> > > 
> >> > > Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
> >> > > ---
> >> > >  arch/arm/boot/dts/armada-38x.dtsi          |  3 +-
> >> > 
> >> > lgtm, i need ack for dts parts before I can apply this
> >> 
> >> I'm not sure what the situation is for Bootlin, but they don't seem to
> >> be very responsive right now (covid related?)
> >> 
> >> What I know from what I've been party to on netdev is that Bootlin
> >> sent a patch for the MVPP2 driver, and the very next day someone
> >> reported that the patch caused a bug.  Unfortunately, the patch got
> >> picked up anyway, but there was no response from Bootlin.  After a
> >> month or so, -final was released containing this patch, so now it
> >> had become a regression - and still no response from Bootlin.
> >> 
> >> Eventually the bug got fixed - not because Bootlin fixed it, but
> >> because I ended up spending the time researching how that part of
> >> the network driver worked, diagnosing what was going on, and
> >> eventually fixing it in the most obvious way - but it's not clear
> >> that the fix was the right approach.  Bootlin never commented.  See
> >> 3138a07ce219 ("net: mvpp2: fix RX hashing for non-10G ports").
> >> 
> >> So, I think we have to assume that Bootlin are struggling right now,
> >> and as it's been over a week, it's unlikely that they are going to
> >> respond soon.  What do you think we should do?
> >> 
> >> I also note that Rob has not responded to the DT binding change
> >> either, despite me gently prodding, and Rob processing a whole raft
> >> of DT binding stuff yesterday.
> >> 
> >> I can split the DTS change from the rest of the patch, but I don't
> >> think that really helps without at least the binding change being
> >> agreed.
> >
> > I would prefer splitting, you may sent the DTS to arm arch folks if no
> > response from subarch folks
> 
> Yes please could you split the patch to put the dts apart ? And if the
> binding is accepted we will apply it.

I don't see any sign that Rob will ever review the DTS part, so I'm
at the point of just not caring about this anymore. I will carry it
in my tree, but I'm going to do nothing further.

That means that switching speed on mvneta on the Armada 38x is can
cause the network to die, but hey, if people can't be bothered to
review, and wish to impose rules such as "you can't change anything
with DT without my express say so" which have the effect of blocking
fixes, that's really not my problem.

So, shrug, I'm giving up with these patches.  Sorry.
Russell King (Oracle) July 13, 2020, 6:07 p.m. UTC | #7
On Mon, Jul 13, 2020 at 06:21:40PM +0100, Russell King - ARM Linux admin wrote:
> On Mon, Jul 13, 2020 at 05:36:54PM +0200, Gregory CLEMENT wrote:
> > Hello,
> > 
> > > On 10-07-20, 16:19, Russell King - ARM Linux admin wrote:
> > >> On Wed, Jul 01, 2020 at 12:27:27PM +0530, Vinod Koul wrote:
> > >> > On 30-06-20, 17:05, Russell King wrote:
> > >> > > The mvneta hardware appears to lock up in various random ways when
> > >> > > repeatedly switching speeds between 1G and 2.5G, which involves
> > >> > > reprogramming the COMPHY.  It is not entirely clear why this happens,
> > >> > > but best guess is that reprogramming the COMPHY glitches mvneta clocks
> > >> > > causing the hardware to fail.  It seems that rebooting resolves the
> > >> > > failure, but not down/up cycling the interface alone.
> > >> > > 
> > >> > > Various other approaches have been tried, such as trying to cleanly
> > >> > > power down the COMPHY and then take it back through the power up
> > >> > > initialisation, but this does not seem to help.
> > >> > > 
> > >> > > It was finally noticed that u-boot's last step when configuring a
> > >> > > COMPHY for "SGMII" mode was to poke at a register described as
> > >> > > "GBE_CONFIGURATION_REG", which is undocumented in any external
> > >> > > documentation.  All that we have is the fact that u-boot sets a bit
> > >> > > corresponding to the "SGMII" lane at the end of COMPHY initialisation.
> > >> > > 
> > >> > > Experimentation shows that if we clear this bit prior to changing the
> > >> > > speed, and then set it afterwards, mvneta does not suffer this problem
> > >> > > on the SolidRun Clearfog when switching speeds between 1G and 2.5G.
> > >> > > 
> > >> > > This problem was found while script-testing phylink.
> > >> > > 
> > >> > > Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
> > >> > > ---
> > >> > >  arch/arm/boot/dts/armada-38x.dtsi          |  3 +-
> > >> > 
> > >> > lgtm, i need ack for dts parts before I can apply this
> > >> 
> > >> I'm not sure what the situation is for Bootlin, but they don't seem to
> > >> be very responsive right now (covid related?)
> > >> 
> > >> What I know from what I've been party to on netdev is that Bootlin
> > >> sent a patch for the MVPP2 driver, and the very next day someone
> > >> reported that the patch caused a bug.  Unfortunately, the patch got
> > >> picked up anyway, but there was no response from Bootlin.  After a
> > >> month or so, -final was released containing this patch, so now it
> > >> had become a regression - and still no response from Bootlin.
> > >> 
> > >> Eventually the bug got fixed - not because Bootlin fixed it, but
> > >> because I ended up spending the time researching how that part of
> > >> the network driver worked, diagnosing what was going on, and
> > >> eventually fixing it in the most obvious way - but it's not clear
> > >> that the fix was the right approach.  Bootlin never commented.  See
> > >> 3138a07ce219 ("net: mvpp2: fix RX hashing for non-10G ports").
> > >> 
> > >> So, I think we have to assume that Bootlin are struggling right now,
> > >> and as it's been over a week, it's unlikely that they are going to
> > >> respond soon.  What do you think we should do?
> > >> 
> > >> I also note that Rob has not responded to the DT binding change
> > >> either, despite me gently prodding, and Rob processing a whole raft
> > >> of DT binding stuff yesterday.
> > >> 
> > >> I can split the DTS change from the rest of the patch, but I don't
> > >> think that really helps without at least the binding change being
> > >> agreed.
> > >
> > > I would prefer splitting, you may sent the DTS to arm arch folks if no
> > > response from subarch folks
> > 
> > Yes please could you split the patch to put the dts apart ? And if the
> > binding is accepted we will apply it.
> 
> I don't see any sign that Rob will ever review the DTS part, so I'm
> at the point of just not caring about this anymore. I will carry it
> in my tree, but I'm going to do nothing further.
> 
> That means that switching speed on mvneta on the Armada 38x is can
> cause the network to die, but hey, if people can't be bothered to
> review, and wish to impose rules such as "you can't change anything
> with DT without my express say so" which have the effect of blocking
> fixes, that's really not my problem.
> 
> So, shrug, I'm giving up with these patches.  Sorry.

To be clear, this is not aimed at either Vinod or Gregory.
Vinod Koul July 16, 2020, 5:46 a.m. UTC | #8
On 13-07-20, 19:07, Russell King - ARM Linux admin wrote:
> On Mon, Jul 13, 2020 at 06:21:40PM +0100, Russell King - ARM Linux admin wrote:
> > On Mon, Jul 13, 2020 at 05:36:54PM +0200, Gregory CLEMENT wrote:
> > > Hello,
> > > 
> > > > On 10-07-20, 16:19, Russell King - ARM Linux admin wrote:
> > > >> On Wed, Jul 01, 2020 at 12:27:27PM +0530, Vinod Koul wrote:
> > > >> > On 30-06-20, 17:05, Russell King wrote:
> > > >> > > The mvneta hardware appears to lock up in various random ways when
> > > >> > > repeatedly switching speeds between 1G and 2.5G, which involves
> > > >> > > reprogramming the COMPHY.  It is not entirely clear why this happens,
> > > >> > > but best guess is that reprogramming the COMPHY glitches mvneta clocks
> > > >> > > causing the hardware to fail.  It seems that rebooting resolves the
> > > >> > > failure, but not down/up cycling the interface alone.
> > > >> > > 
> > > >> > > Various other approaches have been tried, such as trying to cleanly
> > > >> > > power down the COMPHY and then take it back through the power up
> > > >> > > initialisation, but this does not seem to help.
> > > >> > > 
> > > >> > > It was finally noticed that u-boot's last step when configuring a
> > > >> > > COMPHY for "SGMII" mode was to poke at a register described as
> > > >> > > "GBE_CONFIGURATION_REG", which is undocumented in any external
> > > >> > > documentation.  All that we have is the fact that u-boot sets a bit
> > > >> > > corresponding to the "SGMII" lane at the end of COMPHY initialisation.
> > > >> > > 
> > > >> > > Experimentation shows that if we clear this bit prior to changing the
> > > >> > > speed, and then set it afterwards, mvneta does not suffer this problem
> > > >> > > on the SolidRun Clearfog when switching speeds between 1G and 2.5G.
> > > >> > > 
> > > >> > > This problem was found while script-testing phylink.
> > > >> > > 
> > > >> > > Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
> > > >> > > ---
> > > >> > >  arch/arm/boot/dts/armada-38x.dtsi          |  3 +-
> > > >> > 
> > > >> > lgtm, i need ack for dts parts before I can apply this
> > > >> 
> > > >> I'm not sure what the situation is for Bootlin, but they don't seem to
> > > >> be very responsive right now (covid related?)
> > > >> 
> > > >> What I know from what I've been party to on netdev is that Bootlin
> > > >> sent a patch for the MVPP2 driver, and the very next day someone
> > > >> reported that the patch caused a bug.  Unfortunately, the patch got
> > > >> picked up anyway, but there was no response from Bootlin.  After a
> > > >> month or so, -final was released containing this patch, so now it
> > > >> had become a regression - and still no response from Bootlin.
> > > >> 
> > > >> Eventually the bug got fixed - not because Bootlin fixed it, but
> > > >> because I ended up spending the time researching how that part of
> > > >> the network driver worked, diagnosing what was going on, and
> > > >> eventually fixing it in the most obvious way - but it's not clear
> > > >> that the fix was the right approach.  Bootlin never commented.  See
> > > >> 3138a07ce219 ("net: mvpp2: fix RX hashing for non-10G ports").
> > > >> 
> > > >> So, I think we have to assume that Bootlin are struggling right now,
> > > >> and as it's been over a week, it's unlikely that they are going to
> > > >> respond soon.  What do you think we should do?
> > > >> 
> > > >> I also note that Rob has not responded to the DT binding change
> > > >> either, despite me gently prodding, and Rob processing a whole raft
> > > >> of DT binding stuff yesterday.
> > > >> 
> > > >> I can split the DTS change from the rest of the patch, but I don't
> > > >> think that really helps without at least the binding change being
> > > >> agreed.
> > > >
> > > > I would prefer splitting, you may sent the DTS to arm arch folks if no
> > > > response from subarch folks
> > > 
> > > Yes please could you split the patch to put the dts apart ? And if the
> > > binding is accepted we will apply it.
> > 
> > I don't see any sign that Rob will ever review the DTS part, so I'm
> > at the point of just not caring about this anymore. I will carry it
> > in my tree, but I'm going to do nothing further.
> > 
> > That means that switching speed on mvneta on the Armada 38x is can
> > cause the network to die, but hey, if people can't be bothered to
> > review, and wish to impose rules such as "you can't change anything
> > with DT without my express say so" which have the effect of blocking
> > fixes, that's really not my problem.
> > 
> > So, shrug, I'm giving up with these patches.  Sorry.
> 
> To be clear, this is not aimed at either Vinod or Gregory.

Rob has acked, so if you can respin and split, I can apply