diff mbox

[net,4/6] net/mlx4_core: Do not BUG_ON during reset when PCI is offline

Message ID 1455634911-31206-5-git-send-email-ogerlitz@mellanox.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Or Gerlitz Feb. 16, 2016, 3:01 p.m. UTC
From: Daniel Jurgens <danielj@mellanox.com>

The PCI channel could go offline during reset due to EEH.  Don't bug on in
this case, the error is recoverable.

Fixes: f6bc11e42646 ('net/mlx4_core: Enhance the catas flow to support device reset')
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/catas.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Comments

Sergei Shtylyov Feb. 16, 2016, 6:30 p.m. UTC | #1
Hello.

On 02/16/2016 06:01 PM, Or Gerlitz wrote:

> From: Daniel Jurgens <danielj@mellanox.com>
>
> The PCI channel could go offline during reset due to EEH.  Don't bug on in
> this case, the error is recoverable.
>
> Fixes: f6bc11e42646 ('net/mlx4_core: Enhance the catas flow to support device reset')
> Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
> Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/catas.c | 11 +++++++++--
>   1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/catas.c b/drivers/net/ethernet/mellanox/mlx4/catas.c
> index 715de8a..e866082 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/catas.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/catas.c
> @@ -182,10 +182,17 @@ void mlx4_enter_error_state(struct mlx4_dev_persistent *persist)
>   		err = mlx4_reset_slave(dev);
>   	else
>   		err = mlx4_reset_master(dev);
> -	BUG_ON(err != 0);
> +
> +	if (!err)
> +		mlx4_err(dev, "device was reset successfully\n");
> +	else
> +		/* EEH could have disabled the PCI channel during reset. That's
> +		 * recoverable and the PCI error flow will handle it.
> +		 */
> +		if (!pci_channel_offline(dev->persist->pdev))
> +			BUG_ON(1);

    I'm afraid this needs {}.

[...]

MBR, Sergei
Sergei Shtylyov Feb. 16, 2016, 9:21 p.m. UTC | #2
Hello.

On 02/16/2016 06:01 PM, Or Gerlitz wrote:

> From: Daniel Jurgens <danielj@mellanox.com>
>
> The PCI channel could go offline during reset due to EEH.  Don't bug on in
> this case, the error is recoverable.
>
> Fixes: f6bc11e42646 ('net/mlx4_core: Enhance the catas flow to support device reset')
> Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
> Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/catas.c | 11 +++++++++--
>   1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/catas.c b/drivers/net/ethernet/mellanox/mlx4/catas.c
> index 715de8a..e866082 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/catas.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/catas.c
> @@ -182,10 +182,17 @@ void mlx4_enter_error_state(struct mlx4_dev_persistent *persist)
>   		err = mlx4_reset_slave(dev);
>   	else
>   		err = mlx4_reset_master(dev);
> -	BUG_ON(err != 0);
> +

    Just noticed: the empty line here isn't really needed.

> +	if (!err)
> +		mlx4_err(dev, "device was reset successfully\n");
> +	else
> +		/* EEH could have disabled the PCI channel during reset. That's
> +		 * recoverable and the PCI error flow will handle it.
> +		 */
> +		if (!pci_channel_offline(dev->persist->pdev))
> +			BUG_ON(1);
[...]

MBR, Sergei
Or Gerlitz Feb. 17, 2016, 9:21 a.m. UTC | #3
On 2/16/2016 8:30 PM, Sergei Shtylyov wrote:
>> --- a/drivers/net/ethernet/mellanox/mlx4/catas.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/catas.c
>> @@ -182,10 +182,17 @@ void mlx4_enter_error_state(struct 
>> mlx4_dev_persistent *persist)
>>           err = mlx4_reset_slave(dev);
>>       else
>>           err = mlx4_reset_master(dev);
>> -    BUG_ON(err != 0);
>> +
>> +    if (!err)
>> +        mlx4_err(dev, "device was reset successfully\n");
>> +    else
>> +        /* EEH could have disabled the PCI channel during reset. That's
>> +         * recoverable and the PCI error flow will handle it.
>> +         */
>> +        if (!pci_channel_offline(dev->persist->pdev))
>> +            BUG_ON(1);
>
>    I'm afraid this needs {}. 

Hey, don't be afraid just for that, stay cool... we can add that here if 
it helps, as for the blank line not deletedby this patch on which you 
commented later, will not remove it as part of a patch which is a strict 
fix.

Or.
Sergei Shtylyov Feb. 17, 2016, 10:53 a.m. UTC | #4
Hello.

On 2/17/2016 12:21 PM, Or Gerlitz wrote:

>>> --- a/drivers/net/ethernet/mellanox/mlx4/catas.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx4/catas.c
>>> @@ -182,10 +182,17 @@ void mlx4_enter_error_state(struct
>>> mlx4_dev_persistent *persist)
>>>           err = mlx4_reset_slave(dev);
>>>       else
>>>           err = mlx4_reset_master(dev);
>>> -    BUG_ON(err != 0);
>>> +
>>> +    if (!err)
>>> +        mlx4_err(dev, "device was reset successfully\n");
>>> +    else
>>> +        /* EEH could have disabled the PCI channel during reset. That's
>>> +         * recoverable and the PCI error flow will handle it.
>>> +         */
>>> +        if (!pci_channel_offline(dev->persist->pdev))
>>> +            BUG_ON(1);
>>
>>    I'm afraid this needs {}.
>
> Hey, don't be afraid just for that, stay cool... we can add that here if it
> helps, as for the blank line not deletedby this patch on which you commented
> later, will not remove it as part of a patch which is a strict fix.

    Actually it's *added* by the patch, that's why I asked to not do that.

> Or.

MBR, Sergei
Sergei Shtylyov Feb. 17, 2016, 6:50 p.m. UTC | #5
Hello.

On 02/16/2016 06:01 PM, Or Gerlitz wrote:

> From: Daniel Jurgens <danielj@mellanox.com>
>
> The PCI channel could go offline during reset due to EEH.  Don't bug on in
> this case, the error is recoverable.
>
> Fixes: f6bc11e42646 ('net/mlx4_core: Enhance the catas flow to support device reset')
> Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
> Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/catas.c | 11 +++++++++--
>   1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/catas.c b/drivers/net/ethernet/mellanox/mlx4/catas.c
> index 715de8a..e866082 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/catas.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/catas.c
> @@ -182,10 +182,17 @@ void mlx4_enter_error_state(struct mlx4_dev_persistent *persist)
>   		err = mlx4_reset_slave(dev);
>   	else
>   		err = mlx4_reset_master(dev);
> -	BUG_ON(err != 0);
> +
> +	if (!err)
> +		mlx4_err(dev, "device was reset successfully\n");
> +	else
> +		/* EEH could have disabled the PCI channel during reset. That's
> +		 * recoverable and the PCI error flow will handle it.
> +		 */
> +		if (!pci_channel_offline(dev->persist->pdev))
> +			BUG_ON(1);

    BTW, why not just BUG_ON(!pci_channel_offline(dev->persist->pdev));

WBR, Sergei
diff mbox

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx4/catas.c b/drivers/net/ethernet/mellanox/mlx4/catas.c
index 715de8a..e866082 100644
--- a/drivers/net/ethernet/mellanox/mlx4/catas.c
+++ b/drivers/net/ethernet/mellanox/mlx4/catas.c
@@ -182,10 +182,17 @@  void mlx4_enter_error_state(struct mlx4_dev_persistent *persist)
 		err = mlx4_reset_slave(dev);
 	else
 		err = mlx4_reset_master(dev);
-	BUG_ON(err != 0);
+
+	if (!err)
+		mlx4_err(dev, "device was reset successfully\n");
+	else
+		/* EEH could have disabled the PCI channel during reset. That's
+		 * recoverable and the PCI error flow will handle it.
+		 */
+		if (!pci_channel_offline(dev->persist->pdev))
+			BUG_ON(1);
 
 	dev->persist->state |= MLX4_DEVICE_STATE_INTERNAL_ERROR;
-	mlx4_err(dev, "device was reset successfully\n");
 	mutex_unlock(&persist->device_state_mutex);
 
 	/* At that step HW was already reset, now notify clients */