diff mbox series

[net] dsa: mv88e6xxx: Ensure all pending interrupts are handled prior to exit

Message ID 6a1ebc61-3505-beb8-21cb-ea42ad9fe67e@bell.net
State Accepted
Delegated to: David Miller
Headers show
Series [net] dsa: mv88e6xxx: Ensure all pending interrupts are handled prior to exit | expand

Commit Message

John David Anglin Feb. 11, 2019, 6:40 p.m. UTC
The GPIO interrupt controller on the espressobin board only supports edge interrupts.
If one enables the use of hardware interrupts in the device tree for the 88E6341, it is
possible to miss an edge.  When this happens, the INTn pin on the Marvell switch is
stuck low and no further interrupts occur.

I found after adding debug statements to mv88e6xxx_g1_irq_thread_work() that there is
a race in handling device interrupts (e.g. PHY link interrupts).  Some interrupts are
directly cleared by reading the Global 1 status register.  However, the device interrupt
flag, for example, is not cleared until all the unmasked SERDES and PHY ports are serviced.
This is done by reading the relevant SERDES and PHY status register.

The code only services interrupts whose status bit is set at the time of reading its status
register.  If an interrupt event occurs after its status is read and before all interrupts
are serviced, then this event will not be serviced and the INTn output pin will remain low.

This is not a problem with polling or level interrupts since the handler will be called
again to process the event.  However, it's a big problem when using level interrupts.

The fix presented here is to add a loop around the code servicing switch interrupts.  If
any pending interrupts remain after the current set has been handled, we loop and process
the new set.  If there are no pending interrupts after servicing, we are sure that INTn has
gone high and we will get an edge when a new event occurs.

Tested on espressobin board.

Signed-off-by:  John David Anglin <dave.anglin@bell.net>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

Comments

Andrew Lunn Feb. 11, 2019, 11:33 p.m. UTC | #1
> Signed-off-by:  John David Anglin <dave.anglin@bell.net>
> ---
>  drivers/net/dsa/mv88e6xxx/chip.c | 28 ++++++++++++++++++++++------
>  1 file changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
> index 8dca2c949e73..12fd7ce3f1ff 100644
> --- a/drivers/net/dsa/mv88e6xxx/chip.c
> +++ b/drivers/net/dsa/mv88e6xxx/chip.c
> @@ -261,6 +261,7 @@ static irqreturn_t mv88e6xxx_g1_irq_thread_work(struct mv88e6xxx_chip *chip)
>  	unsigned int sub_irq;
>  	unsigned int n;
>  	u16 reg;
> +	u16 ctl1;
>  	int err;
> 
>  	mutex_lock(&chip->reg_lock);
> @@ -270,13 +271,28 @@ static irqreturn_t mv88e6xxx_g1_irq_thread_work(struct mv88e6xxx_chip *chip)
>  	if (err)
>  		goto out;
> 
> -	for (n = 0; n < chip->g1_irq.nirqs; ++n) {
> -		if (reg & (1 << n)) {
> -			sub_irq = irq_find_mapping(chip->g1_irq.domain, n);
> -			handle_nested_irq(sub_irq);
> -			++nhandled;
> +	do {
> +		for (n = 0; n < chip->g1_irq.nirqs; ++n) {
> +			if (reg & (1 << n)) {
> +				sub_irq = irq_find_mapping(chip->g1_irq.domain,
> +							   n);
> +				handle_nested_irq(sub_irq);
> +				++nhandled;
> +			}
>  		}
> -	}
> +
> +		mutex_lock(&chip->reg_lock);
> +		err = mv88e6xxx_g1_read(chip, MV88E6XXX_G1_CTL1, &ctl1);
> +		if (err)
> +			goto unlock;
> +		err = mv88e6xxx_g1_read(chip, MV88E6XXX_G1_STS, &reg);
> +unlock:
> +		mutex_unlock(&chip->reg_lock);
> +		if (err)
> +			goto out;
> +		ctl1 &= GENMASK(chip->g1_irq.nirqs, 0);
> +	} while (reg & ctl1);

Hi David

I just tested this on one of my boards. It loops endlessly:

[   47.173396] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
[   47.182108] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
[   47.190820] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
[   47.199535] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
[   47.208254] mv88e6xxx_g1_irq_thread_work: c881 a8 80   

These are reg, ctl1, reg & ctl1.

So there is an unhandled device interrupt. I think this is because
device interrupts are not masked before installing the interrupt
handler. But i've not fully got to the bottom of this yet.

	 Andrew
John David Anglin Feb. 12, 2019, 12:57 a.m. UTC | #2
On 2019-02-11 6:33 p.m., Andrew Lunn wrote:
>> Signed-off-by:  John David Anglin <dave.anglin@bell.net>
>> ---
>>  drivers/net/dsa/mv88e6xxx/chip.c | 28 ++++++++++++++++++++++------
>>  1 file changed, 22 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
>> index 8dca2c949e73..12fd7ce3f1ff 100644
>> --- a/drivers/net/dsa/mv88e6xxx/chip.c
>> +++ b/drivers/net/dsa/mv88e6xxx/chip.c
>> @@ -261,6 +261,7 @@ static irqreturn_t mv88e6xxx_g1_irq_thread_work(struct mv88e6xxx_chip *chip)
>>  	unsigned int sub_irq;
>>  	unsigned int n;
>>  	u16 reg;
>> +	u16 ctl1;
>>  	int err;
>>
>>  	mutex_lock(&chip->reg_lock);
>> @@ -270,13 +271,28 @@ static irqreturn_t mv88e6xxx_g1_irq_thread_work(struct mv88e6xxx_chip *chip)
>>  	if (err)
>>  		goto out;
>>
>> -	for (n = 0; n < chip->g1_irq.nirqs; ++n) {
>> -		if (reg & (1 << n)) {
>> -			sub_irq = irq_find_mapping(chip->g1_irq.domain, n);
>> -			handle_nested_irq(sub_irq);
>> -			++nhandled;
>> +	do {
>> +		for (n = 0; n < chip->g1_irq.nirqs; ++n) {
>> +			if (reg & (1 << n)) {
>> +				sub_irq = irq_find_mapping(chip->g1_irq.domain,
>> +							   n);
>> +				handle_nested_irq(sub_irq);
>> +				++nhandled;
>> +			}
>>  		}
>> -	}
>> +
>> +		mutex_lock(&chip->reg_lock);
>> +		err = mv88e6xxx_g1_read(chip, MV88E6XXX_G1_CTL1, &ctl1);
>> +		if (err)
>> +			goto unlock;
>> +		err = mv88e6xxx_g1_read(chip, MV88E6XXX_G1_STS, &reg);
>> +unlock:
>> +		mutex_unlock(&chip->reg_lock);
>> +		if (err)
>> +			goto out;
>> +		ctl1 &= GENMASK(chip->g1_irq.nirqs, 0);
>> +	} while (reg & ctl1);
> Hi David
>
> I just tested this on one of my boards. It loops endlessly:
>
> [   47.173396] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> [   47.182108] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> [   47.190820] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> [   47.199535] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> [   47.208254] mv88e6xxx_g1_irq_thread_work: c881 a8 80   
>
> These are reg, ctl1, reg & ctl1.
>
> So there is an unhandled device interrupt. I think this is because
> device interrupts are not masked before installing the interrupt
> handler. But i've not fully got to the bottom of this yet.
Yes, it is true the PHY and SERDES enables in Global 2 should be cleared before the interrupt handler
is installed for device interrupts.  That's what is done for the interrupts enables in Global 1.  I'm
not seeing that these enables are initialized.

Which switch?  The device interrupts are not be cleared properly on that board.  Would it be possible
to also print the Global 2 status and enables?  Unplugging the cable that's causing the loop might
cause the loop to stop.

I suspect the same would happen if level interrupts were used.

I tested both edge and polling on espressobin with Armada 3700.  There's no problem with
looping there.  I've booted it many times.  I've unplugged and plugged cables many times.

Dave
Andrew Lunn Feb. 12, 2019, 1:21 a.m. UTC | #3
> Yes, it is true the PHY and SERDES enables in Global 2 should be
> cleared before the interrupt handler is installed for device
> interrupts.  That's what is done for the interrupts enables in
> Global 1.  I'm not seeing that these enables are initialized.
> 
> Which switch?

6390X.

> The device interrupts are not be cleared properly on that board.

I added in code to mask all interrupts. It did not help. I need to go
deeper and see if it is a PHY problem.

> I suspect the same would happen if level interrupts were used.

I've not seen it loop. Which is why i want to understand it fully.

     Andrew
Andrew Lunn Feb. 12, 2019, 3:58 a.m. UTC | #4
> > Hi David
> >
> > I just tested this on one of my boards. It loops endlessly:
> >
> > [   47.173396] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> > [   47.182108] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> > [   47.190820] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> > [   47.199535] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> > [   47.208254] mv88e6xxx_g1_irq_thread_work: c881 a8 80   
> >
> > These are reg, ctl1, reg & ctl1.
> >
> > So there is an unhandled device interrupt.

Hi Heiner

Your patch Fixes: 2b3e88ea6528 ("net: phy: improve phy state
checking") is causing me problems with interrupts for the Marvell
switches.

That change means we don't check the PHY device if it caused an
interrupt when its state is less than UP.

What i'm seeing is that the PHY is interrupting pretty early on after
a reboot when the previous boot had the interface up.

[   10.125702] Marvell 88E6390 mv88e6xxx-0:02: phy_start_interrupts
[   10.162798] Marvell 88E6390 mv88e6xxx-0:02: phy_enable_interrupts
[   10.168931] Marvell 88E6390 mv88e6xxx-0:02: marvell_ack_interrupt
[   10.180164] Marvell 88E6390 mv88e6xxx-0:02: marvell_config_intr 1

a little later it interrupts:

[   12.999717] mv88e6xxx_g1_irq_thread_fn
[   13.007253] mv88e6xxx_g2_irq_thread_fn: 4 811c 4
[   13.012015] libphy: __phy_is_started: phydev->state 1 PHY_UP 3
[   13.017941] Marvell 88E6390 mv88e6xxx-0:02: phy_interrupt: phy_is_started(phydev) 0

The current code just causes it to be ignored. So the interrupts fires
again, and again...

If i change to code to call into the PHY driver and let it handle the
interrupts, things keep running. A little bit later the interface is
configured up:

[   15.921326] mv88e6085 gpio-0:00 red: configuring for phy/gmii link mode
[   15.928693] libphy: __phy_is_started: phydev->state 3 PHY_UP 3
[   15.929442] IPv6: ADDRCONF(NETDEV_UP): red: link is not ready
[   15.935596] Marvell 88E6390 mv88e6xxx-0:02: m88e6390_config_aneg
[   15.935608] Marvell 88E6390 mv88e6xxx-0:02: m88e6390_errata

[   16.071364] Marvell 88E6390 mv88e6xxx-0:02: m88e1510_config_aneg
[   16.112362] Marvell 88E6390 mv88e6xxx-0:02: m88e1318_config_aneg
[   16.151245] Marvell 88E6390 mv88e6xxx-0:02: m88e1121_config_aneg
[   16.368206] Marvell 88E6390 mv88e6xxx-0:02: PHY state change UP -> NOLINK

and after another interrupt the link goes up.

[   19.519840] mv88e6xxx_g1_irq_thread_fn
[   19.528546] mv88e6xxx_g2_irq_thread_fn: 4 811c 4
[   19.534152] libphy: __phy_is_started: phydev->state 5 PHY_UP 3
[   19.540030] Marvell 88E6390 mv88e6xxx-0:02: phy_interrupt: phy_is_started(phydev) 1
[   19.547721] Marvell 88E6390 mv88e6xxx-0:02: m88e1121_did_interrupt
[   19.559829] Marvell 88E6390 mv88e6xxx-0:02: marvell_ack_interrupt
[   19.590753] Marvell 88E6390 mv88e6xxx-0:02: marvell_read_status
[   19.596712] Marvell 88E6390 mv88e6xxx-0:02: marvell_update_link
[   19.628387] Marvell 88E6390 mv88e6xxx-0:02: PHY state change NOLINK -> RUNNING
[   19.628453] mv88e6085 gpio-0:00 red: Link is Up - 1Gbps/Full - flow control off
[   19.635920] IPv6: ADDRCONF(NETDEV_CHANGE): red: link becomes ready

I don't yet know why the first interrupt happens, before we configure
auto-neg, etc. But it is not too unreasonable. We have configured
interrupts, so it could be reporting link down etc.

So i think we might need to revert part of this change, call into the
driver so long as the PHY is not in state PHY_HALTED.

What do you think?

     Andrew
Heiner Kallweit Feb. 12, 2019, 6:51 a.m. UTC | #5
On 12.02.2019 04:58, Andrew Lunn wrote:
>>> Hi David
>>>
>>> I just tested this on one of my boards. It loops endlessly:
>>>
>>> [   47.173396] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
>>> [   47.182108] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
>>> [   47.190820] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
>>> [   47.199535] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
>>> [   47.208254] mv88e6xxx_g1_irq_thread_work: c881 a8 80   
>>>
>>> These are reg, ctl1, reg & ctl1.
>>>
>>> So there is an unhandled device interrupt.
> 
> Hi Heiner
> 
> Your patch Fixes: 2b3e88ea6528 ("net: phy: improve phy state
> checking") is causing me problems with interrupts for the Marvell
> switches.
> 
Hi Andrew,

what kernel version is it?
And the PHY driver in use is "Marvell 88E6390" ?

> That change means we don't check the PHY device if it caused an
> interrupt when its state is less than UP.
> 
> What i'm seeing is that the PHY is interrupting pretty early on after
> a reboot when the previous boot had the interface up.
> 
So this means that when going down for reboot the interrupts are not
properly masked / disabled? Because (at least for net-next) we enable
interrupts in phy_start() only.


> [   10.125702] Marvell 88E6390 mv88e6xxx-0:02: phy_start_interrupts
> [   10.162798] Marvell 88E6390 mv88e6xxx-0:02: phy_enable_interrupts
> [   10.168931] Marvell 88E6390 mv88e6xxx-0:02: marvell_ack_interrupt
> [   10.180164] Marvell 88E6390 mv88e6xxx-0:02: marvell_config_intr 1
> 
> a little later it interrupts:
> 
> [   12.999717] mv88e6xxx_g1_irq_thread_fn
> [   13.007253] mv88e6xxx_g2_irq_thread_fn: 4 811c 4
> [   13.012015] libphy: __phy_is_started: phydev->state 1 PHY_UP 3
> [   13.017941] Marvell 88E6390 mv88e6xxx-0:02: phy_interrupt: phy_is_started(phydev) 0
> 
> The current code just causes it to be ignored. So the interrupts fires
> again, and again...
> 
I would have more expected the opposite. If the interrupt is ignored
(IRQ_NONE returned), then it doesn't get acked. And if it's not acked
new interrupts should be blocked. Or is it different with this chip?

> If i change to code to call into the PHY driver and let it handle the
> interrupts, things keep running. A little bit later the interface is
> configured up:
> 
> [   15.921326] mv88e6085 gpio-0:00 red: configuring for phy/gmii link mode
> [   15.928693] libphy: __phy_is_started: phydev->state 3 PHY_UP 3
> [   15.929442] IPv6: ADDRCONF(NETDEV_UP): red: link is not ready
> [   15.935596] Marvell 88E6390 mv88e6xxx-0:02: m88e6390_config_aneg
> [   15.935608] Marvell 88E6390 mv88e6xxx-0:02: m88e6390_errata
> 
> [   16.071364] Marvell 88E6390 mv88e6xxx-0:02: m88e1510_config_aneg
> [   16.112362] Marvell 88E6390 mv88e6xxx-0:02: m88e1318_config_aneg
> [   16.151245] Marvell 88E6390 mv88e6xxx-0:02: m88e1121_config_aneg
> [   16.368206] Marvell 88E6390 mv88e6xxx-0:02: PHY state change UP -> NOLINK
> 
> and after another interrupt the link goes up.
> 
> [   19.519840] mv88e6xxx_g1_irq_thread_fn
> [   19.528546] mv88e6xxx_g2_irq_thread_fn: 4 811c 4
> [   19.534152] libphy: __phy_is_started: phydev->state 5 PHY_UP 3
> [   19.540030] Marvell 88E6390 mv88e6xxx-0:02: phy_interrupt: phy_is_started(phydev) 1
> [   19.547721] Marvell 88E6390 mv88e6xxx-0:02: m88e1121_did_interrupt
> [   19.559829] Marvell 88E6390 mv88e6xxx-0:02: marvell_ack_interrupt
> [   19.590753] Marvell 88E6390 mv88e6xxx-0:02: marvell_read_status
> [   19.596712] Marvell 88E6390 mv88e6xxx-0:02: marvell_update_link
> [   19.628387] Marvell 88E6390 mv88e6xxx-0:02: PHY state change NOLINK -> RUNNING
> [   19.628453] mv88e6085 gpio-0:00 red: Link is Up - 1Gbps/Full - flow control off
> [   19.635920] IPv6: ADDRCONF(NETDEV_CHANGE): red: link becomes ready
> 
> I don't yet know why the first interrupt happens, before we configure
> auto-neg, etc. But it is not too unreasonable. We have configured
> interrupts, so it could be reporting link down etc.
> 
> So i think we might need to revert part of this change, call into the
> driver so long as the PHY is not in state PHY_HALTED.
> 
> What do you think?
> 
I will take a closer look later.

>      Andrew
> 
Heiner
Andrew Lunn Feb. 12, 2019, 12:56 p.m. UTC | #6
On Tue, Feb 12, 2019 at 07:51:05AM +0100, Heiner Kallweit wrote:
> On 12.02.2019 04:58, Andrew Lunn wrote:
> >>> Hi David
> >>>
> >>> I just tested this on one of my boards. It loops endlessly:
> >>>
> >>> [   47.173396] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> >>> [   47.182108] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> >>> [   47.190820] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> >>> [   47.199535] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
> >>> [   47.208254] mv88e6xxx_g1_irq_thread_work: c881 a8 80   
> >>>
> >>> These are reg, ctl1, reg & ctl1.
> >>>
> >>> So there is an unhandled device interrupt.
> > 
> > Hi Heiner
> > 
> > Your patch Fixes: 2b3e88ea6528 ("net: phy: improve phy state
> > checking") is causing me problems with interrupts for the Marvell
> > switches.
> > 
> Hi Andrew,
> 
> what kernel version is it?

It is a little bit old, 5.0-rc1 net-next. I should rebase and
retest. I'm testing on a ZII board which is not fully in mainline So i
need some patches.

> And the PHY driver in use is "Marvell 88E6390" ?

Yes, the marvell 1G driver.

     Andrew
Russell King (Oracle) Feb. 12, 2019, 4:30 p.m. UTC | #7
On Tue, Feb 12, 2019 at 07:51:05AM +0100, Heiner Kallweit wrote:
> On 12.02.2019 04:58, Andrew Lunn wrote:
> > That change means we don't check the PHY device if it caused an
> > interrupt when its state is less than UP.
> > 
> > What i'm seeing is that the PHY is interrupting pretty early on after
> > a reboot when the previous boot had the interface up.
> > 
> So this means that when going down for reboot the interrupts are not
> properly masked / disabled? Because (at least for net-next) we enable
> interrupts in phy_start() only.

Looking at Linus' tree as opposed to net-next, things do look rather
broken wrt interrupts:

+-phy_attach_direct
  `-phydev->state = PHY_READY
+-phy_prepare_link
+-phy_start_machine
  `-phy_trigger_machine()
`-phy_start_interrupts
  +-request_threaded_irq()
  `-phy_enable_interrupts()
    +-phy_clear_interrupt()
    `-phy_config_interrupt(, PHY_INTERRUPT_ENABLED)

At this point, the PHY is then able to generate interrupts, which,
because phy_start() has not been called and phy_interrupt() checks
that phydev->state >= PHY_UP, get ignored by the interrupt handler
exactly as Andrew is finding.

So it looks like 5.0-rc is already in need of this being fixed.

In looking at this, I came across this chunk of code:

static inline bool __phy_is_started(struct phy_device *phydev)
{
        WARN_ON(!mutex_is_locked(&phydev->lock));

        return phydev->state >= PHY_UP;
}

/**
 * phy_is_started - Convenience function to check whether PHY is started
 * @phydev: The phy_device struct
 */
static inline bool phy_is_started(struct phy_device *phydev)
{
        bool started;

        mutex_lock(&phydev->lock);
        started = __phy_is_started(phydev);
        mutex_unlock(&phydev->lock);

        return started;
}

which looks to me like over-complication.  The mutex locking there is
completely pointless - what are you trying to achieve with it?

Let's go through this.  The above is exactly equivalent to:

bool phy_is_started(phydev)
{
	int state;

	mutex_lock(&phydev->lock);
	state = phydev->state;
	mutex_unlock(&phydev->lock);

	return state >= PHY_UP;
}

since when we do the test is irrelevant.  Architectures that Linux
runs on are single-copy atomic, which means that reading phydev->state
itself is an atomic operation.  So, the mutex locking around that
doesn't add to the atomicity of the entire operation.

How, depending on what you do with the rest of this function depends
whether the entire operation is safe or not.  For example, let's take
this code at the end of phy_state_machine():

        if (phy_polling_mode(phydev) && phy_is_started(phydev))
                phy_queue_state_machine(phydev, PHY_STATE_TIME);

state = PHY_UP
		thread 0			thread 1
						phy_disconnect()
						+-phy_is_started()
		phy_is_started()                |
						`-phy_stop()
						  +-phydev->state = PHY_HALTED
						  `-phy_stop_machine()
						    `-cancel_delayed_work_sync()
		phy_queue_state_machine()
		`-mod_delayed_work()

At this point, the phydev->state_queue() has been added back onto the
system workqueue despite phy_stop_machine() having been called and
cancel_delayed_work_sync() called on it.

The original code in 4.20 did not have this race condition.

Basically, the lock inside phy_is_started() does nothing useful, and
I'd say is dangerously misleading.
Heiner Kallweit Feb. 12, 2019, 6:42 p.m. UTC | #8
On 12.02.2019 13:56, Andrew Lunn wrote:
> On Tue, Feb 12, 2019 at 07:51:05AM +0100, Heiner Kallweit wrote:
>> On 12.02.2019 04:58, Andrew Lunn wrote:
>>>>> Hi David
>>>>>
>>>>> I just tested this on one of my boards. It loops endlessly:
>>>>>
>>>>> [   47.173396] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
>>>>> [   47.182108] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
>>>>> [   47.190820] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
>>>>> [   47.199535] mv88e6xxx_g1_irq_thread_work: c881 a8 80                         
>>>>> [   47.208254] mv88e6xxx_g1_irq_thread_work: c881 a8 80   
>>>>>
>>>>> These are reg, ctl1, reg & ctl1.
>>>>>
>>>>> So there is an unhandled device interrupt.
>>>
>>> Hi Heiner
>>>
>>> Your patch Fixes: 2b3e88ea6528 ("net: phy: improve phy state
>>> checking") is causing me problems with interrupts for the Marvell
>>> switches.
>>>
>> Hi Andrew,
>>
>> what kernel version is it?
> 
> It is a little bit old, 5.0-rc1 net-next. I should rebase and
> retest. I'm testing on a ZII board which is not fully in mainline So i
> need some patches.
> 
Thanks, Andrew. Indeed 5.0 needs a fix, as also pointed out by Russell.
I think I will simply remove the following:

if (!phy_is_started(phydev))
	return IRQ_NONE;	

Then we basically do the same like phy_mac_interrupt(), we always run
the state machine. If it has nothing to do, then it does nothing.
Therefore also state HALTED doesn't need a special handling.
This way we handle interrupts (incl. spurious ones) gracefully.

>> And the PHY driver in use is "Marvell 88E6390" ?
> 
> Yes, the marvell 1G driver.
> 
>      Andrew
> .
> 
Heiner
John David Anglin Feb. 12, 2019, 8:09 p.m. UTC | #9
On 2019-02-12 7:56 a.m., Andrew Lunn wrote:
> It is a little bit old, 5.0-rc1 net-next. I should rebase and
> retest. I'm testing on a ZII board which is not fully in mainline So i
> need some patches.
I attempted to test 5.0.0-rc5 on espressobin but it dies on boot:
Starting kernel ...

[    0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
[    0.000000] Linux version 5.0.0-rc5+ (root@espressobin) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PREEMPT Tue Feb 12
10:02:30 EST 2019
[    0.000000] Machine model: Globalscale Marvell ESPRESSOBin Board
[    0.000000] earlycon: ar3700_uart0 at MMIO 0x00000000d0012000 (options '')
[    0.000000] printk: bootconsole [ar3700_uart0] enabled
[    3.218865] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[    3.224521] Modules linked in:
[    3.227661] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc5+ #1
[    3.234107] Hardware name: Globalscale Marvell ESPRESSOBin Board (DT)
[    3.240740] pstate: 20000005 (nzCv daif -PAN -UAO)
[    3.245676] pc : dma_direct_map_page+0x48/0x1d8
[    3.250331] lr : mv_xor_channel_add+0x3b0/0xb28
[    3.254984] sp : ffffff8010033a60
[    3.258389] x29: ffffff8010033a60 x28: ffffffc03bf70480
[    3.263854] x27: ffffff8010e97068 x26: 0000000000000000
[    3.269320] x25: 0000000000000029 x24: 0000000000000083
[    3.274785] x23: 0000000000000000 x22: 0000000000000002
[    3.280251] x21: 0000000000000080 x20: ffffff8010ecd000
[    3.285716] x19: 0000000000000000 x18: ffffffffffffffff
[    3.291182] x17: 000000000000000c x16: 000000000000000a
[    3.296648] x15: ffffff8010ecd6c8 x14: ffffffc03bf46783
[    3.302113] x13: ffffffc03bf46782 x12: 0000000000000038
[    3.307579] x11: 0000000000001fff x10: 0000000000000001
[    3.313044] x9 : 0000000000000000 x8 : ffffff8010dbe000
[    3.318510] x7 : ffffff8010fbe000 x6 : ffffffbf00000000
[    3.323976] x5 : 0000000000000000 x4 : 0000000000000002
[    3.329441] x3 : 0000000000000002 x2 : 00000000000006ac
[    3.334907] x1 : ffffffbf00efdc00 x0 : 0000000000000000
[    3.340373] Process swapper/0 (pid: 1, stack limit = 0x(____ptrval____))
[    3.347272] Call trace:
[    3.349784]  dma_direct_map_page+0x48/0x1d8
[    3.354084]  mv_xor_channel_add+0x3b0/0xb28
[    3.358384]  mv_xor_probe+0x20c/0x4b8
[    3.362150]  platform_drv_probe+0x50/0xb0
[    3.366269]  really_probe+0x1fc/0x2c0
[    3.370032]  driver_probe_device+0x58/0x100
[    3.374332]  __driver_attach+0xd8/0xe0
[    3.378188]  bus_for_each_dev+0x68/0xc8
[    3.382129]  driver_attach+0x20/0x28
[    3.385803]  bus_add_driver+0x108/0x228
[    3.389744]  driver_register+0x60/0x110
[    3.393687]  __platform_driver_register+0x44/0x50
[    3.398529]  mv_xor_driver_init+0x18/0x20
[    3.402648]  do_one_initcall+0x58/0x170
[    3.406591]  kernel_init_freeable+0x190/0x234
[    3.411072]  kernel_init+0x10/0x108
[    3.414653]  ret_from_fork+0x10/0x1c
[    3.418329] Code: 2a0403f6 934cfc00 aa0503f7 7100047f (f9412663)
[    3.424610] ---[ end trace f7751570455a07a0 ]---
[    3.429423] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    3.437232] SMP: stopping secondary CPUs
[    3.441265] Kernel Offset: disabled
[    3.444849] CPU features: 0x002,2000200c
[    3.448878] Memory Limit: none
[    3.452018] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

Dave
Heiner Kallweit Feb. 12, 2019, 8:11 p.m. UTC | #10
On 12.02.2019 17:30, Russell King - ARM Linux admin wrote:
> On Tue, Feb 12, 2019 at 07:51:05AM +0100, Heiner Kallweit wrote:
>> On 12.02.2019 04:58, Andrew Lunn wrote:
>>> That change means we don't check the PHY device if it caused an
>>> interrupt when its state is less than UP.
>>>
>>> What i'm seeing is that the PHY is interrupting pretty early on after
>>> a reboot when the previous boot had the interface up.
>>>
>> So this means that when going down for reboot the interrupts are not
>> properly masked / disabled? Because (at least for net-next) we enable
>> interrupts in phy_start() only.
> 
> Looking at Linus' tree as opposed to net-next, things do look rather
> broken wrt interrupts:
> 
> +-phy_attach_direct
>   `-phydev->state = PHY_READY
> +-phy_prepare_link
> +-phy_start_machine
>   `-phy_trigger_machine()
> `-phy_start_interrupts
>   +-request_threaded_irq()
>   `-phy_enable_interrupts()
>     +-phy_clear_interrupt()
>     `-phy_config_interrupt(, PHY_INTERRUPT_ENABLED)
> 
> At this point, the PHY is then able to generate interrupts, which,
> because phy_start() has not been called and phy_interrupt() checks
> that phydev->state >= PHY_UP, get ignored by the interrupt handler
> exactly as Andrew is finding.
> 
> So it looks like 5.0-rc is already in need of this being fixed.
> 
> In looking at this, I came across this chunk of code:
> 
> static inline bool __phy_is_started(struct phy_device *phydev)
> {
>         WARN_ON(!mutex_is_locked(&phydev->lock));
> 
>         return phydev->state >= PHY_UP;
> }
> 
> /**
>  * phy_is_started - Convenience function to check whether PHY is started
>  * @phydev: The phy_device struct
>  */
> static inline bool phy_is_started(struct phy_device *phydev)
> {
>         bool started;
> 
>         mutex_lock(&phydev->lock);
>         started = __phy_is_started(phydev);
>         mutex_unlock(&phydev->lock);
> 
>         return started;
> }
> 
> which looks to me like over-complication.  The mutex locking there is
> completely pointless - what are you trying to achieve with it?
> 
Even though this code is new it's kind of heritage in phylib that each
access (read or write) to phydev->state is protected by this lock.
I also once wondered whether it's actually needed but didn't spend
effort so far on challenging this. Seems that now the time has come ..

> Let's go through this.  The above is exactly equivalent to:
> 
> bool phy_is_started(phydev)
> {
> 	int state;
> 
> 	mutex_lock(&phydev->lock);
> 	state = phydev->state;
> 	mutex_unlock(&phydev->lock);
> 
> 	return state >= PHY_UP;
> }
> 
> since when we do the test is irrelevant.  Architectures that Linux
> runs on are single-copy atomic, which means that reading phydev->state
> itself is an atomic operation.  So, the mutex locking around that
> doesn't add to the atomicity of the entire operation.
> 
> How, depending on what you do with the rest of this function depends
> whether the entire operation is safe or not.  For example, let's take
> this code at the end of phy_state_machine():
> 
>         if (phy_polling_mode(phydev) && phy_is_started(phydev))
>                 phy_queue_state_machine(phydev, PHY_STATE_TIME);
> 
> state = PHY_UP
> 		thread 0			thread 1
> 						phy_disconnect()
> 						+-phy_is_started()
> 		phy_is_started()                |
> 						`-phy_stop()
> 						  +-phydev->state = PHY_HALTED
> 						  `-phy_stop_machine()
> 						    `-cancel_delayed_work_sync()
> 		phy_queue_state_machine()
> 		`-mod_delayed_work()
> 
Thanks for describing this scenario, I'll have a closer look at it.

> At this point, the phydev->state_queue() has been added back onto the
> system workqueue despite phy_stop_machine() having been called and
> cancel_delayed_work_sync() called on it.
> 
> The original code in 4.20 did not have this race condition.
> 
> Basically, the lock inside phy_is_started() does nothing useful, and
> I'd say is dangerously misleading.
> 

Heiner
Heiner Kallweit Feb. 12, 2019, 8:54 p.m. UTC | #11
On 12.02.2019 17:30, Russell King - ARM Linux admin wrote:
> On Tue, Feb 12, 2019 at 07:51:05AM +0100, Heiner Kallweit wrote:
>> On 12.02.2019 04:58, Andrew Lunn wrote:
>>> That change means we don't check the PHY device if it caused an
>>> interrupt when its state is less than UP.
>>>
>>> What i'm seeing is that the PHY is interrupting pretty early on after
>>> a reboot when the previous boot had the interface up.
>>>
>> So this means that when going down for reboot the interrupts are not
>> properly masked / disabled? Because (at least for net-next) we enable
>> interrupts in phy_start() only.
> 
[..]
> In looking at this, I came across this chunk of code:
> 
> static inline bool __phy_is_started(struct phy_device *phydev)
> {
>         WARN_ON(!mutex_is_locked(&phydev->lock));
> 
>         return phydev->state >= PHY_UP;
> }
> 
> /**
>  * phy_is_started - Convenience function to check whether PHY is started
>  * @phydev: The phy_device struct
>  */
> static inline bool phy_is_started(struct phy_device *phydev)
> {
>         bool started;
> 
>         mutex_lock(&phydev->lock);
>         started = __phy_is_started(phydev);
>         mutex_unlock(&phydev->lock);
> 
>         return started;
> }
> 
> which looks to me like over-complication.  The mutex locking there is
> completely pointless - what are you trying to achieve with it?
> 
> Let's go through this.  The above is exactly equivalent to:
> 
> bool phy_is_started(phydev)
> {
> 	int state;
> 
> 	mutex_lock(&phydev->lock);
> 	state = phydev->state;
> 	mutex_unlock(&phydev->lock);
> 
> 	return state >= PHY_UP;
> }
> 
> since when we do the test is irrelevant.  Architectures that Linux
> runs on are single-copy atomic, which means that reading phydev->state
> itself is an atomic operation.  So, the mutex locking around that
> doesn't add to the atomicity of the entire operation.
> 
> How, depending on what you do with the rest of this function depends
> whether the entire operation is safe or not.  For example, let's take
> this code at the end of phy_state_machine():
> 
>         if (phy_polling_mode(phydev) && phy_is_started(phydev))
>                 phy_queue_state_machine(phydev, PHY_STATE_TIME);
> 
> state = PHY_UP
> 		thread 0			thread 1
> 						phy_disconnect()
> 						+-phy_is_started()
> 		phy_is_started()                |
> 						`-phy_stop()
> 						  +-phydev->state = PHY_HALTED
> 						  `-phy_stop_machine()
> 						    `-cancel_delayed_work_sync()
> 		phy_queue_state_machine()
> 		`-mod_delayed_work()
> 
> At this point, the phydev->state_queue() has been added back onto the
> system workqueue despite phy_stop_machine() having been called and
> cancel_delayed_work_sync() called on it.
> 
> The original code in 4.20 did not have this race condition.
> 
> Basically, the lock inside phy_is_started() does nothing useful, and
> I'd say is dangerously misleading.
> 
Then idea would be to first remove the locking from phy_is_started()
and in a second step do the following to prevent the described race
(phy_stop() takes phydev->lock too).

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index c1ed03800..69dc64a4d 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -957,8 +957,10 @@ void phy_state_machine(struct work_struct *work)
         * state machine would be pointless and possibly error prone when
         * called from phy_disconnect() synchronously.
         */
+       mutex_lock(&phydev->lock);
        if (phy_polling_mode(phydev) && phy_is_started(phydev))
                phy_queue_state_machine(phydev, PHY_STATE_TIME);
+       mutex_unlock(&phydev->lock);
 }

Heiner
Russell King (Oracle) Feb. 12, 2019, 10:55 p.m. UTC | #12
On Tue, Feb 12, 2019 at 09:54:55PM +0100, Heiner Kallweit wrote:
> On 12.02.2019 17:30, Russell King - ARM Linux admin wrote:
> > On Tue, Feb 12, 2019 at 07:51:05AM +0100, Heiner Kallweit wrote:
> >> On 12.02.2019 04:58, Andrew Lunn wrote:
> >>> That change means we don't check the PHY device if it caused an
> >>> interrupt when its state is less than UP.
> >>>
> >>> What i'm seeing is that the PHY is interrupting pretty early on after
> >>> a reboot when the previous boot had the interface up.
> >>>
> >> So this means that when going down for reboot the interrupts are not
> >> properly masked / disabled? Because (at least for net-next) we enable
> >> interrupts in phy_start() only.
> > 
> [..]
> > In looking at this, I came across this chunk of code:
> > 
> > static inline bool __phy_is_started(struct phy_device *phydev)
> > {
> >         WARN_ON(!mutex_is_locked(&phydev->lock));
> > 
> >         return phydev->state >= PHY_UP;
> > }
> > 
> > /**
> >  * phy_is_started - Convenience function to check whether PHY is started
> >  * @phydev: The phy_device struct
> >  */
> > static inline bool phy_is_started(struct phy_device *phydev)
> > {
> >         bool started;
> > 
> >         mutex_lock(&phydev->lock);
> >         started = __phy_is_started(phydev);
> >         mutex_unlock(&phydev->lock);
> > 
> >         return started;
> > }
> > 
> > which looks to me like over-complication.  The mutex locking there is
> > completely pointless - what are you trying to achieve with it?
> > 
> > Let's go through this.  The above is exactly equivalent to:
> > 
> > bool phy_is_started(phydev)
> > {
> > 	int state;
> > 
> > 	mutex_lock(&phydev->lock);
> > 	state = phydev->state;
> > 	mutex_unlock(&phydev->lock);
> > 
> > 	return state >= PHY_UP;
> > }
> > 
> > since when we do the test is irrelevant.  Architectures that Linux
> > runs on are single-copy atomic, which means that reading phydev->state
> > itself is an atomic operation.  So, the mutex locking around that
> > doesn't add to the atomicity of the entire operation.
> > 
> > How, depending on what you do with the rest of this function depends
> > whether the entire operation is safe or not.  For example, let's take
> > this code at the end of phy_state_machine():
> > 
> >         if (phy_polling_mode(phydev) && phy_is_started(phydev))
> >                 phy_queue_state_machine(phydev, PHY_STATE_TIME);
> > 
> > state = PHY_UP
> > 		thread 0			thread 1
> > 						phy_disconnect()
> > 						+-phy_is_started()
> > 		phy_is_started()                |
> > 						`-phy_stop()
> > 						  +-phydev->state = PHY_HALTED
> > 						  `-phy_stop_machine()
> > 						    `-cancel_delayed_work_sync()
> > 		phy_queue_state_machine()
> > 		`-mod_delayed_work()
> > 
> > At this point, the phydev->state_queue() has been added back onto the
> > system workqueue despite phy_stop_machine() having been called and
> > cancel_delayed_work_sync() called on it.
> > 
> > The original code in 4.20 did not have this race condition.
> > 
> > Basically, the lock inside phy_is_started() does nothing useful, and
> > I'd say is dangerously misleading.
> > 
> Then idea would be to first remove the locking from phy_is_started()
> and in a second step do the following to prevent the described race
> (phy_stop() takes phydev->lock too).
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index c1ed03800..69dc64a4d 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -957,8 +957,10 @@ void phy_state_machine(struct work_struct *work)
>          * state machine would be pointless and possibly error prone when
>          * called from phy_disconnect() synchronously.
>          */
> +       mutex_lock(&phydev->lock);
>         if (phy_polling_mode(phydev) && phy_is_started(phydev))
>                 phy_queue_state_machine(phydev, PHY_STATE_TIME);
> +       mutex_unlock(&phydev->lock);
>  }

Yep, that approach would certainly be better.  I didn't exhaustively
audit the 5.0-rc code though.
Andrew Lunn Feb. 14, 2019, 2:07 a.m. UTC | #13
On Mon, Feb 11, 2019 at 01:40:21PM -0500, John David Anglin wrote:
> The GPIO interrupt controller on the espressobin board only supports edge interrupts.
> If one enables the use of hardware interrupts in the device tree for the 88E6341, it is
> possible to miss an edge.  When this happens, the INTn pin on the Marvell switch is
> stuck low and no further interrupts occur.
> 
> I found after adding debug statements to mv88e6xxx_g1_irq_thread_work() that there is
> a race in handling device interrupts (e.g. PHY link interrupts).  Some interrupts are
> directly cleared by reading the Global 1 status register.  However, the device interrupt
> flag, for example, is not cleared until all the unmasked SERDES and PHY ports are serviced.
> This is done by reading the relevant SERDES and PHY status register.
> 
> The code only services interrupts whose status bit is set at the time of reading its status
> register.  If an interrupt event occurs after its status is read and before all interrupts
> are serviced, then this event will not be serviced and the INTn output pin will remain low.
> 
> This is not a problem with polling or level interrupts since the handler will be called
> again to process the event.  However, it's a big problem when using level interrupts.
> 
> The fix presented here is to add a loop around the code servicing switch interrupts.  If
> any pending interrupts remain after the current set has been handled, we loop and process
> the new set.  If there are no pending interrupts after servicing, we are sure that INTn has
> gone high and we will get an edge when a new event occurs.
> 
> Tested on espressobin board.
> 
> Signed-off-by:  John David Anglin <dave.anglin@bell.net>

Fixes: dc30c35be720 ("net: dsa: mv88e6xxx: Implement interrupt support.")

Tested-by: Andrew Lunn <andrew@lunn.ch>

David, please ensure that Heiner's patch:

net: phy: fix interrupt handling in non-started states

is applied first. Otherwise we can get into an interrupt storm.

    Andrew
David Miller Feb. 14, 2019, 4:47 a.m. UTC | #14
From: Andrew Lunn <andrew@lunn.ch>
Date: Thu, 14 Feb 2019 03:07:23 +0100

> On Mon, Feb 11, 2019 at 01:40:21PM -0500, John David Anglin wrote:
>> The GPIO interrupt controller on the espressobin board only supports edge interrupts.
>> If one enables the use of hardware interrupts in the device tree for the 88E6341, it is
>> possible to miss an edge.  When this happens, the INTn pin on the Marvell switch is
>> stuck low and no further interrupts occur.
>> 
>> I found after adding debug statements to mv88e6xxx_g1_irq_thread_work() that there is
>> a race in handling device interrupts (e.g. PHY link interrupts).  Some interrupts are
>> directly cleared by reading the Global 1 status register.  However, the device interrupt
>> flag, for example, is not cleared until all the unmasked SERDES and PHY ports are serviced.
>> This is done by reading the relevant SERDES and PHY status register.
>> 
>> The code only services interrupts whose status bit is set at the time of reading its status
>> register.  If an interrupt event occurs after its status is read and before all interrupts
>> are serviced, then this event will not be serviced and the INTn output pin will remain low.
>> 
>> This is not a problem with polling or level interrupts since the handler will be called
>> again to process the event.  However, it's a big problem when using level interrupts.
>> 
>> The fix presented here is to add a loop around the code servicing switch interrupts.  If
>> any pending interrupts remain after the current set has been handled, we loop and process
>> the new set.  If there are no pending interrupts after servicing, we are sure that INTn has
>> gone high and we will get an edge when a new event occurs.
>> 
>> Tested on espressobin board.
>> 
>> Signed-off-by:  John David Anglin <dave.anglin@bell.net>
> 
> Fixes: dc30c35be720 ("net: dsa: mv88e6xxx: Implement interrupt support.")
> 
> Tested-by: Andrew Lunn <andrew@lunn.ch>
> 
> David, please ensure that Heiner's patch:
> 
> net: phy: fix interrupt handling in non-started states
> 
> is applied first. Otherwise we can get into an interrupt storm.

Ok, all done.

Should I queue just this one for -stable?  I didn't queue up Heiner's change for
-stable because it fixes a 5.0-rcX regression.
Andrew Lunn Feb. 14, 2019, 4:50 a.m. UTC | #15
> Ok, all done.

Thanks

> Should I queue just this one for -stable?  I didn't queue up Heiner's change for
> -stable because it fixes a 5.0-rcX regression.

Yes please.

    Andrew
David Miller Feb. 14, 2019, 3:27 p.m. UTC | #16
From: Andrew Lunn <andrew@lunn.ch>
Date: Thu, 14 Feb 2019 05:50:19 +0100

>> Should I queue just this one for -stable?  I didn't queue up Heiner's change for
>> -stable because it fixes a 5.0-rcX regression.
> 
> Yes please.

Done.
diff mbox series

Patch

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 8dca2c949e73..12fd7ce3f1ff 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -261,6 +261,7 @@  static irqreturn_t mv88e6xxx_g1_irq_thread_work(struct mv88e6xxx_chip *chip)
 	unsigned int sub_irq;
 	unsigned int n;
 	u16 reg;
+	u16 ctl1;
 	int err;

 	mutex_lock(&chip->reg_lock);
@@ -270,13 +271,28 @@  static irqreturn_t mv88e6xxx_g1_irq_thread_work(struct mv88e6xxx_chip *chip)
 	if (err)
 		goto out;

-	for (n = 0; n < chip->g1_irq.nirqs; ++n) {
-		if (reg & (1 << n)) {
-			sub_irq = irq_find_mapping(chip->g1_irq.domain, n);
-			handle_nested_irq(sub_irq);
-			++nhandled;
+	do {
+		for (n = 0; n < chip->g1_irq.nirqs; ++n) {
+			if (reg & (1 << n)) {
+				sub_irq = irq_find_mapping(chip->g1_irq.domain,
+							   n);
+				handle_nested_irq(sub_irq);
+				++nhandled;
+			}
 		}
-	}
+
+		mutex_lock(&chip->reg_lock);
+		err = mv88e6xxx_g1_read(chip, MV88E6XXX_G1_CTL1, &ctl1);
+		if (err)
+			goto unlock;
+		err = mv88e6xxx_g1_read(chip, MV88E6XXX_G1_STS, &reg);
+unlock:
+		mutex_unlock(&chip->reg_lock);
+		if (err)
+			goto out;
+		ctl1 &= GENMASK(chip->g1_irq.nirqs, 0);
+	} while (reg & ctl1);
+
 out:
 	return (nhandled > 0 ? IRQ_HANDLED : IRQ_NONE);
 }