diff mbox

Regression: [PATCH] mlx4: give precise rx/tx bytes/packets counters

Message ID 1480539652.18162.205.camel@edumazet-glaptop3.roam.corp.google.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Nov. 30, 2016, 9 p.m. UTC
On Wed, 2016-11-30 at 22:42 +0200, Saeed Mahameed wrote:
> On Wed, Nov 30, 2016 at 7:35 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Wed, 2016-11-30 at 18:46 +0200, Saeed Mahameed wrote:
> >
> >> we had/still have the proper stats they are the ones that
> >> mlx4_en_fold_software_stats is trying to cache into  (they always
> >> exist),
> >> but the ones that you are trying to read from (the mlx4 rings) are gone !
> >>
> >> This bug is totally new and as i warned, this is another symptom of
> >> the real root cause (can't sleep while reading stats).
> >>
> >> Eric what do you suggest ? Keep pre-allocated MAX_RINGS stats  and
> >> always iterate over all of them to query stats ?
> >> what if you have one ring/none/1K ? how would you know how many to query ?
> >
> > I am suggesting I will fix the bug I introduced.
> >
> > Do not panic.
> >
> >
> 
> Not at all, I trust you are the only one who is capable of providing
> the best solution.
> I am just trying to read your mind :-).
> 
> As i said i like the solution and i want to adapt it to mlx5, so I am
> a little bit enthusiastic :)

What about the following fix guys ?

As a bonus we update the stats right before they are sent to monitors
via rtnetlink ;)

Comments

Jesper Dangaard Brouer Dec. 1, 2016, 12:37 p.m. UTC | #1
On Wed, 30 Nov 2016 13:00:52 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2016-11-30 at 22:42 +0200, Saeed Mahameed wrote:
> > On Wed, Nov 30, 2016 at 7:35 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:  
> > > On Wed, 2016-11-30 at 18:46 +0200, Saeed Mahameed wrote:
[...]
> > >
> > > I am suggesting I will fix the bug I introduced.
> > >
> > > Do not panic.
> > >
> > >  
> > 
> > Not at all, I trust you are the only one who is capable of providing
> > the best solution.
> > I am just trying to read your mind :-).
> > 
> > As i said i like the solution and i want to adapt it to mlx5, so I am
> > a little bit enthusiastic :)  
> 
> What about the following fix guys ?

Confirming this fixed the crash on shutdown for me, thanks!

 
> As a bonus we update the stats right before they are sent to monitors
> via rtnetlink ;)
> 
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 12ea3405f442717478bf0e8882edaf0de77986cb..091b904262bc7932d3edf99cf850affb23b9ce6e 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -1809,8 +1809,12 @@ void mlx4_en_stop_port(struct net_device *dev, int detach)
>  
>  	netif_tx_disable(dev);
>  
> +	spin_lock_bh(&priv->stats_lock);
> +	mlx4_en_fold_software_stats(dev);
>  	/* Set port as not active */
>  	priv->port_up = false;
> +	spin_unlock_bh(&priv->stats_lock);
> +
>  	priv->counter_index = MLX4_SINK_COUNTER_INDEX(mdev->dev);
>  
>  	/* Promsicuous mode */
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_port.c b/drivers/net/ethernet/mellanox/mlx4/en_port.c
> index c6c4f1238923e09eced547454b86c68720292859..9166d90e732858610b1407fe85cbf6cbe27f5e0b 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_port.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_port.c
> @@ -154,7 +154,7 @@ void mlx4_en_fold_software_stats(struct net_device *dev)
>  	unsigned long packets, bytes;
>  	int i;
>  
> -	if (mlx4_is_master(mdev->dev))
> +	if (!priv->port_up || mlx4_is_master(mdev->dev))
>  		return;
>  
>  	packets = 0;
> 
>
Saeed Mahameed Dec. 1, 2016, 3:38 p.m. UTC | #2
On Wed, Nov 30, 2016 at 11:00 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 22:42 +0200, Saeed Mahameed wrote:
>> On Wed, Nov 30, 2016 at 7:35 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Wed, 2016-11-30 at 18:46 +0200, Saeed Mahameed wrote:
>> >
>> >> we had/still have the proper stats they are the ones that
>> >> mlx4_en_fold_software_stats is trying to cache into  (they always
>> >> exist),
>> >> but the ones that you are trying to read from (the mlx4 rings) are gone !
>> >>
>> >> This bug is totally new and as i warned, this is another symptom of
>> >> the real root cause (can't sleep while reading stats).
>> >>
>> >> Eric what do you suggest ? Keep pre-allocated MAX_RINGS stats  and
>> >> always iterate over all of them to query stats ?
>> >> what if you have one ring/none/1K ? how would you know how many to query ?
>> >
>> > I am suggesting I will fix the bug I introduced.
>> >
>> > Do not panic.
>> >
>> >
>>
>> Not at all, I trust you are the only one who is capable of providing
>> the best solution.
>> I am just trying to read your mind :-).
>>
>> As i said i like the solution and i want to adapt it to mlx5, so I am
>> a little bit enthusiastic :)
>
> What about the following fix guys ?
>
> As a bonus we update the stats right before they are sent to monitors
> via rtnetlink ;)

Hi Eric, Thanks for the patch, I already acked it.

I have one educational question (not related to this patch, but
related to stats reading in general).
I was wondering why do we need to disable bh every time we read stats
"spin_lock_bh" ? is it essential ?

I checked and in mlx4 we don't hold stats_lock in softirq
(en_rx.c/en_tx.c), so I don't see any deadlock risk in here..

 Thanks
Saeed.
Eric Dumazet Dec. 1, 2016, 3:55 p.m. UTC | #3
On Thu, 2016-12-01 at 17:38 +0200, Saeed Mahameed wrote:

> 
> Hi Eric, Thanks for the patch, I already acked it.

Thanks !

> 
> I have one educational question (not related to this patch, but
> related to stats reading in general).
> I was wondering why do we need to disable bh every time we read stats
> "spin_lock_bh" ? is it essential ?
> 
> I checked and in mlx4 we don't hold stats_lock in softirq
> (en_rx.c/en_tx.c), so I don't see any deadlock risk in here..

Excellent question, and I chose to keep the spinlock.

That would be doable, only if we do not overwrite dev->stats.

Current code is :

static struct rtnl_link_stats64 *
mlx4_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
{
        struct mlx4_en_priv *priv = netdev_priv(dev);

        spin_lock_bh(&priv->stats_lock);
        mlx4_en_fold_software_stats(dev);
        netdev_stats_to_stats64(stats, &dev->stats);
        spin_unlock_bh(&priv->stats_lock);

        return stats;
}

If you remove the spin_lock_bh() :

static struct rtnl_link_stats64 *
mlx4_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
{
        struct mlx4_en_priv *priv = netdev_priv(dev);

        mlx4_en_fold_software_stats(dev); // possible races

        netdev_stats_to_stats64(stats, &dev->stats);

        return stats;
}

1) one mlx4_en_fold_software_stats(dev) could be preempted
on a CONFIG_PREEMPT kernel, or interrupted by long irqs.

2) Another cpu would also call mlx4_en_fold_software_stats(dev) while
   first cpu is busy.

3) Then when resuming first cpu/thread, part of the dev->stats fieds 
would be updated with 'old counters',
while another thread might have updated them with newer values.

4) A SNMP reader could then get counters that are not monotonically
increasing,
which would be confusing/buggy.

So removing the spinlock is doable, but needs to add a new parameter
to mlx4_en_fold_software_stats() and call netdev_stats_to_stats64()
before mlx4_en_fold_software_stats(dev)

static struct rtnl_link_stats64 *
mlx4_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
{
        struct mlx4_en_priv *priv = netdev_priv(dev);

        netdev_stats_to_stats64(stats, &dev->stats);

	// Passing a non NULL stats asks mlx4_en_fold_software_stats()
	// to not update dev->stats, but stats directly.

        mlx4_en_fold_software_stats(dev, stats)


        return stats;
}
Saeed Mahameed Dec. 1, 2016, 4:33 p.m. UTC | #4
On Thu, Dec 1, 2016 at 5:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2016-12-01 at 17:38 +0200, Saeed Mahameed wrote:
>
>>
>> Hi Eric, Thanks for the patch, I already acked it.
>
> Thanks !
>
>>
>> I have one educational question (not related to this patch, but
>> related to stats reading in general).
>> I was wondering why do we need to disable bh every time we read stats
>> "spin_lock_bh" ? is it essential ?
>>
>> I checked and in mlx4 we don't hold stats_lock in softirq
>> (en_rx.c/en_tx.c), so I don't see any deadlock risk in here..
>
> Excellent question, and I chose to keep the spinlock.
>
> That would be doable, only if we do not overwrite dev->stats.
>
> Current code is :
>
> static struct rtnl_link_stats64 *
> mlx4_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
> {
>         struct mlx4_en_priv *priv = netdev_priv(dev);
>
>         spin_lock_bh(&priv->stats_lock);
>         mlx4_en_fold_software_stats(dev);
>         netdev_stats_to_stats64(stats, &dev->stats);
>         spin_unlock_bh(&priv->stats_lock);
>
>         return stats;
> }
>
> If you remove the spin_lock_bh() :
>
>
> static struct rtnl_link_stats64 *
> mlx4_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
> {
>         struct mlx4_en_priv *priv = netdev_priv(dev);
>
>         mlx4_en_fold_software_stats(dev); // possible races
>
>         netdev_stats_to_stats64(stats, &dev->stats);
>
>         return stats;
> }
>
> 1) one mlx4_en_fold_software_stats(dev) could be preempted
> on a CONFIG_PREEMPT kernel, or interrupted by long irqs.
>
> 2) Another cpu would also call mlx4_en_fold_software_stats(dev) while
>    first cpu is busy.
>
> 3) Then when resuming first cpu/thread, part of the dev->stats fieds
> would be updated with 'old counters',
> while another thread might have updated them with newer values.
>
> 4) A SNMP reader could then get counters that are not monotonically
> increasing,
> which would be confusing/buggy.
>
> So removing the spinlock is doable, but needs to add a new parameter
> to mlx4_en_fold_software_stats() and call netdev_stats_to_stats64()
> before mlx4_en_fold_software_stats(dev)
>
> static struct rtnl_link_stats64 *
> mlx4_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
> {
>         struct mlx4_en_priv *priv = netdev_priv(dev);
>
>         netdev_stats_to_stats64(stats, &dev->stats);
>
>         // Passing a non NULL stats asks mlx4_en_fold_software_stats()
>         // to not update dev->stats, but stats directly.
>
>         mlx4_en_fold_software_stats(dev, stats)
>
>
>         return stats;
> }
>
>

Thanks for the detailed answer !!

BTW you went 5 steps ahead of my original question :)), so far you
already have a patch without locking at all (really impressive).

What i wanted to ask originally, was regarding the "_bh", i didn't
mean to completely remove the "spin_lock_bh",
I meant, what happens if we replace "spin_lock_bh"  with "spin_lock",
without disabling bh ?
I gues raw "sping_lock" handles points (2 to 4) from above, but it
won't handle long irqs.
Eric Dumazet Dec. 1, 2016, 5:08 p.m. UTC | #5
On Thu, 2016-12-01 at 18:33 +0200, Saeed Mahameed wrote:

> Thanks for the detailed answer !!

You're welcome.

> 
> BTW you went 5 steps ahead of my original question :)), so far you
> already have a patch without locking at all (really impressive).
> 
> What i wanted to ask originally, was regarding the "_bh", i didn't
> mean to completely remove the "spin_lock_bh",
> I meant, what happens if we replace "spin_lock_bh"  with "spin_lock",
> without disabling bh ?
> I gues raw "sping_lock" handles points (2 to 4) from above, but it
> won't handle long irqs.

Thats a very good point, the _bh prefix can totally be removed, since
stats_lock is only acquired from process context.
Saeed Mahameed Dec. 4, 2016, 1:21 p.m. UTC | #6
On Thu, Dec 1, 2016 at 7:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2016-12-01 at 18:33 +0200, Saeed Mahameed wrote:
>
>> Thanks for the detailed answer !!
>
> You're welcome.
>
>>
>> BTW you went 5 steps ahead of my original question :)), so far you
>> already have a patch without locking at all (really impressive).
>>
>> What i wanted to ask originally, was regarding the "_bh", i didn't
>> mean to completely remove the "spin_lock_bh",
>> I meant, what happens if we replace "spin_lock_bh"  with "spin_lock",
>> without disabling bh ?
>> I gues raw "sping_lock" handles points (2 to 4) from above, but it
>> won't handle long irqs.
>
> Thats a very good point, the _bh prefix can totally be removed, since
> stats_lock is only acquired from process context.
>
>

That was my initial point, Thanks for the help.
will provide a fix patch later once 4.9 is release.
diff mbox

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 12ea3405f442717478bf0e8882edaf0de77986cb..091b904262bc7932d3edf99cf850affb23b9ce6e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1809,8 +1809,12 @@  void mlx4_en_stop_port(struct net_device *dev, int detach)
 
 	netif_tx_disable(dev);
 
+	spin_lock_bh(&priv->stats_lock);
+	mlx4_en_fold_software_stats(dev);
 	/* Set port as not active */
 	priv->port_up = false;
+	spin_unlock_bh(&priv->stats_lock);
+
 	priv->counter_index = MLX4_SINK_COUNTER_INDEX(mdev->dev);
 
 	/* Promsicuous mode */
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_port.c b/drivers/net/ethernet/mellanox/mlx4/en_port.c
index c6c4f1238923e09eced547454b86c68720292859..9166d90e732858610b1407fe85cbf6cbe27f5e0b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_port.c
@@ -154,7 +154,7 @@  void mlx4_en_fold_software_stats(struct net_device *dev)
 	unsigned long packets, bytes;
 	int i;
 
-	if (mlx4_is_master(mdev->dev))
+	if (!priv->port_up || mlx4_is_master(mdev->dev))
 		return;
 
 	packets = 0;