diff mbox series

[RFC,net-next,1/2] net: net-porcfs: Reduce rcu lock critical section

Message ID 20180410170812.18905-1-saeedm@mellanox.com
State RFC, archived
Delegated to: David Miller
Headers show
Series [RFC,net-next,1/2] net: net-porcfs: Reduce rcu lock critical section | expand

Commit Message

Saeed Mahameed April 10, 2018, 5:08 p.m. UTC
The current net proc fs sequence file implementation to show current
namespace netdevs list statistics and mc lists holds the rcu lock
throughout the whole process, from dev seq start up to dev seq stop.

This is really greedy and demanding from device drivers since
ndo_get_stats64 called from dev_seq_show while the rcu lock is held.

The rcu lock is needed to guarantee that device chain is not modified
while the dev sequence file is walking through it and handling the
netdev in the same time.

To minimize this critical section and drastically reduce the time rcu lock
is being held, all we need is to grab the rcu lock only for the brief
moment where we are looking for the next netdev to handle, if found,
dev_hold it to guarantee it kept alive while accessed later in seq show
callback and release the rcu lock immediately.

The current netdev being handled will be released "dev_put" when the seq next
callback is called or dev seq stop is called.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 net/core/net-procfs.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

Comments

David Miller April 10, 2018, 5:16 p.m. UTC | #1
From: Saeed Mahameed <saeedm@mellanox.com>
Date: Tue, 10 Apr 2018 10:08:11 -0700

> The current net proc fs sequence file implementation to show current
> namespace netdevs list statistics and mc lists holds the rcu lock
> throughout the whole process, from dev seq start up to dev seq stop.
> 
> This is really greedy and demanding from device drivers since
> ndo_get_stats64 called from dev_seq_show while the rcu lock is held.
> 
> The rcu lock is needed to guarantee that device chain is not modified
> while the dev sequence file is walking through it and handling the
> netdev in the same time.
> 
> To minimize this critical section and drastically reduce the time rcu lock
> is being held, all we need is to grab the rcu lock only for the brief
> moment where we are looking for the next netdev to handle, if found,
> dev_hold it to guarantee it kept alive while accessed later in seq show
> callback and release the rcu lock immediately.
> 
> The current netdev being handled will be released "dev_put" when the seq next
> callback is called or dev seq stop is called.
> 
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

The tradeoff here is that now you are doing two unnecessary atomic
operations per stats dump.

That is what the RCU lock allows us to avoid.
Eric Dumazet April 10, 2018, 8:35 p.m. UTC | #2
On 04/10/2018 10:16 AM, David Miller wrote:
> 
> The tradeoff here is that now you are doing two unnecessary atomic
> operations per stats dump.
> 
> That is what the RCU lock allows us to avoid.
> 

dev_hold() and dev_put() are actually per cpu increment and decrements,
pretty cheap these days.

Problem here is that any preemption of the process holding device reference
might trigger warnings in device unregister.
Saeed Mahameed April 11, 2018, 6:59 p.m. UTC | #3
On Tue, 2018-04-10 at 13:35 -0700, Eric Dumazet wrote:
> 
> On 04/10/2018 10:16 AM, David Miller wrote:
> > 
> > The tradeoff here is that now you are doing two unnecessary atomic
> > operations per stats dump.
> > 
> > That is what the RCU lock allows us to avoid.
> > 
> 
> dev_hold() and dev_put() are actually per cpu increment and
> decrements,
> pretty cheap these days.
> 

Yes, i am adding only 2 cpu instructions here.
I think the trade-off here is too small and the price to finally have
get_stats64 called from non atomic context is really worth it.

It  looks really odd to me that the device chain locks are held for
such long periods, while we already have the means to avoid this, same
goes for rtnl_lock, same trick can work here for many use cases and
many ndos, we are just over protective for no reasons.


> Problem here is that any preemption of the process holding device
> reference
> might trigger warnings in device unregister.
> 

This is true for any other place dev_hold is used,
as explained in the commit message dev_hold is used for a very brief
moment before calling the stats ndo and released directly after.

looking at 

netdev_wait_allrefs(...)
[...]
	msleep(250);

	refcnt = netdev_refcnt_read(dev);

	if (time_after(jiffies, warning_time + 10 * HZ)) {
		pr_emerg("unregister_netdevice: waiting for %s to
become free. Usage count = %d\n",
			 dev->name, refcnt);
		warning_time = jiffies;
	}

The holder will get enough time to release the netdev way before the
warning is triggered.

The warning is triggered only if someone holds the dev for more than 10
seconds which is impossible for the stats ndo to take more than this,
in fact i just did a quick measurement and it seems that in average
get_stats64 ndo takes 0.5us !
Eric Dumazet April 11, 2018, 10:30 p.m. UTC | #4
On 04/11/2018 11:59 AM, Saeed Mahameed wrote:
> On Tue, 2018-04-10 at 13:35 -0700, Eric Dumazet wrote:
>>
>> On 04/10/2018 10:16 AM, David Miller wrote:
>>>
>>> The tradeoff here is that now you are doing two unnecessary atomic
>>> operations per stats dump.
>>>
>>> That is what the RCU lock allows us to avoid.
>>>
>>
>> dev_hold() and dev_put() are actually per cpu increment and
>> decrements,
>> pretty cheap these days.
>>
> 
> Yes, i am adding only 2 cpu instructions here.
> I think the trade-off here is too small and the price to finally have
> get_stats64 called from non atomic context is really worth it.

Oh... but you have not mentioned this reason in your changelog.

What about bonding stats ?

By sending partial patches like that, others have to take care of the details
and this is not really acceptable.

> 
> It  looks really odd to me that the device chain locks are held for
> such long periods, while we already have the means to avoid this, same
> goes for rtnl_lock, same trick can work here for many use cases and
> many ndos, we are just over protective for no reasons.
> 
> 
>> Problem here is that any preemption of the process holding device
>> reference
>> might trigger warnings in device unregister.
>>
> 
> This is true for any other place dev_hold is used,
> as explained in the commit message dev_hold is used for a very brief
> moment before calling the stats ndo and released directly after.

Not really.

Other places usually have notifiers to remove the refcount when needed.

We worked quite hard in the past to remove extra dev_hold()
(before we finally converted it to percpu counter)

> 
> looking at 
> 
> netdev_wait_allrefs(...)
> [...]
> 	msleep(250);
> 
> 	refcnt = netdev_refcnt_read(dev);
> 
> 	if (time_after(jiffies, warning_time + 10 * HZ)) {
> 		pr_emerg("unregister_netdevice: waiting for %s to
> become free. Usage count = %d\n",
> 			 dev->name, refcnt);
> 		warning_time = jiffies;
> 	}
> 
> The holder will get enough time to release the netdev way before the
> warning is triggered.
> 
> The warning is triggered only if someone holds the dev for more than 10
> seconds which is impossible for the stats ndo to take more than this,
> in fact i just did a quick measurement and it seems that in average
> get_stats64 ndo takes 0.5us !

Average is nice, but what about max time ?

Sleeping more than 10 seconds to satify GFP_KERNEL memory allocation can definitely
happen in the real world, or simply if you get preempted by some RT/high prio tasks.

Just say no to 'might sleep in ndo_get_stats()', and you will save a lot of issues.
Saeed Mahameed April 11, 2018, 11:47 p.m. UTC | #5
On Wed, 2018-04-11 at 15:30 -0700, Eric Dumazet wrote:
> 
> On 04/11/2018 11:59 AM, Saeed Mahameed wrote:
> > On Tue, 2018-04-10 at 13:35 -0700, Eric Dumazet wrote:
> > > 
> > > On 04/10/2018 10:16 AM, David Miller wrote:
> > > > 
> > > > The tradeoff here is that now you are doing two unnecessary
> > > > atomic
> > > > operations per stats dump.
> > > > 
> > > > That is what the RCU lock allows us to avoid.
> > > > 
> > > 
> > > dev_hold() and dev_put() are actually per cpu increment and
> > > decrements,
> > > pretty cheap these days.
> > > 
> > 
> > Yes, i am adding only 2 cpu instructions here.
> > I think the trade-off here is too small and the price to finally
> > have
> > get_stats64 called from non atomic context is really worth it.
> 
> Oh... but you have not mentioned this reason in your changelog.
> 

from the commit message:

"This is really greedy and demanding from device drivers since
ndo_get_stats64 called from dev_seq_show while the rcu lock is held"

sorry if this wasn't clear enough I will fix this if this goes through.

> What about bonding stats ?
> 
> By sending partial patches like that, others have to take care of the
> details
> and this is not really acceptable.
> 

This is a RFC just to show the approach, if the approach is acceptable
of course i will provide a full series that will handle all other
places, the change should be the same, I already have 2 other patches
to address ovs and netlink stats, i just didn't want to waste your time
on small details like netlink messages.

> > 
> > It  looks really odd to me that the device chain locks are held for
> > such long periods, while we already have the means to avoid this,
> > same
> > goes for rtnl_lock, same trick can work here for many use cases and
> > many ndos, we are just over protective for no reasons.
> > 
> > 
> > > Problem here is that any preemption of the process holding device
> > > reference
> > > might trigger warnings in device unregister.
> > > 
> > 
> > This is true for any other place dev_hold is used,
> > as explained in the commit message dev_hold is used for a very
> > brief
> > moment before calling the stats ndo and released directly after.
> 
> Not really.
> 
> Other places usually have notifiers to remove the refcount when
> needed.
> 

Other places hold the netdev for the whole lifetime of the netdev, they
don't know when to release it, this is why we need notifiers.

in this patch the approach is:

hold
call ndo
put

notifier is not needed in this case.

> We worked quite hard in the past to remove extra dev_hold()
> (before we finally converted it to percpu counter)
> 
> > 
> > looking at 
> > 
> > netdev_wait_allrefs(...)
> > [...]
> > 	msleep(250);
> > 
> > 	refcnt = netdev_refcnt_read(dev);
> > 
> > 	if (time_after(jiffies, warning_time + 10 * HZ)) {
> > 		pr_emerg("unregister_netdevice: waiting for %s to
> > become free. Usage count = %d\n",
> > 			 dev->name, refcnt);
> > 		warning_time = jiffies;
> > 	}
> > 
> > The holder will get enough time to release the netdev way before
> > the
> > warning is triggered.
> > 
> > The warning is triggered only if someone holds the dev for more
> > than 10
> > seconds which is impossible for the stats ndo to take more than
> > this,
> > in fact i just did a quick measurement and it seems that in average
> > get_stats64 ndo takes 0.5us !
> 
> Average is nice, but what about max time ?
> 

Well if we allow devices to access HW counters via FW command
interfaces in ndo_get_stats and by testing mlx5 where we query up to 5
hw registers, it could take 100us, still this is way smaller than 10sec
 :) and it is really a nice rate to fetch HW stats on demand.

> Sleeping more than 10 seconds to satify GFP_KERNEL memory allocation
> can definitely
> happen in the real world, or simply if you get preempted by some
> RT/high prio tasks.
> 

Same issue can occur without this patch if an IRQ is triggered under
while under rcu lock.
And RT/high prio task shouldn't take more than 10sec.

In case GFP_KERNEL memory allocation takes more than 10sec you will
already get tons of OOM warnings.
Having a netdev warning in case of really the netdev is being
unregistered and ndo_get_stats was preempted for more than 10 seconds
will be the least of your problems.

> Just say no to 'might sleep in ndo_get_stats()', and you will save a
> lot of issues.

We are just being over protective here.

Just say no to 'might sleep in ndo_get_stats()' is by itself creating a
lot of issues, many drivers out there have a background thread running
N times a second just to cache HW stats. which is really silly and CPU
consuming for no reason.

The rate of ndo_get_stats should be equal to the rate the driver can
actually provide fresh stats, any background thread is just a a waste
of CPU. Counters should be fetched from HW on demand rather than
periodically for no reason.

The same goes for set_rx_mode ndo.

Bottom line it looks like the need for rcu locking today is only meant
for synchronizing accessing the netdev ndo with the device chain in
question (namespace/ovs/bonding/etc .. ).

There are many places where dev_get_stats is called from non atmoic
context where the caller knows it is safe to access the netdev ndo.

example:
net/bondign/bond_main.c: bond_enslave(..)

/* initialize slave stats */
dev_get_stats(new_slave->dev, &new_slave->slave_stats);
Eric Dumazet April 12, 2018, 2:59 a.m. UTC | #6
On 04/11/2018 04:47 PM, Saeed Mahameed wrote:
> 
> Well if we allow devices to access HW counters via FW command
> interfaces in ndo_get_stats and by testing mlx5 where we query up to 5
> hw registers, it could take 100us, still this is way smaller than 10sec
>  :) and it is really a nice rate to fetch HW stats on demand.

If hardware stats are slower than software ones, maybe it is time to use software stats,
instead of changing the whole stack ?

There are very few devices drivers having issues like that.
Saeed Mahameed April 12, 2018, 7:12 p.m. UTC | #7
On Wed, 2018-04-11 at 19:59 -0700, Eric Dumazet wrote:
> 
> On 04/11/2018 04:47 PM, Saeed Mahameed wrote:
> > 
> > Well if we allow devices to access HW counters via FW command
> > interfaces in ndo_get_stats and by testing mlx5 where we query up
> > to 5
> > hw registers, it could take 100us, still this is way smaller than
> > 10sec
> >  :) and it is really a nice rate to fetch HW stats on demand.
> 
> If hardware stats are slower than software ones, maybe it is time to
> use software stats,
> instead of changing the whole stack ?
> 

We already have SW stats for [rx/tx]_[packets/bytes] but for
[rx/tx]_[error/drop] etc .. they can only be grabbed from HW.

We don't want to report only partial counters to get_stats ndo just to
avoid sleeping.

> There are very few devices drivers having issues like that.
>
Saeed Mahameed April 16, 2018, 8:50 p.m. UTC | #8
On Thu, 2018-04-12 at 19:12 +0000, Saeed Mahameed wrote:
> On Wed, 2018-04-11 at 19:59 -0700, Eric Dumazet wrote:
> > 
> > On 04/11/2018 04:47 PM, Saeed Mahameed wrote:
> > > 
> > > Well if we allow devices to access HW counters via FW command
> > > interfaces in ndo_get_stats and by testing mlx5 where we query up
> > > to 5
> > > hw registers, it could take 100us, still this is way smaller than
> > > 10sec
> > >  :) and it is really a nice rate to fetch HW stats on demand.
> > 
> > If hardware stats are slower than software ones, maybe it is time
> > to
> > use software stats,
> > instead of changing the whole stack ?
> > 
> 
> We already have SW stats for [rx/tx]_[packets/bytes] but for
> [rx/tx]_[error/drop] etc .. they can only be grabbed from HW.
> 
> We don't want to report only partial counters to get_stats ndo just
> to
> avoid sleeping.
> 
> > There are very few devices drivers having issues like that.
> > 

Dave, Eric, I would like to know whether it is ok to have this change
in the kernel (make get_stats ndo "might sleep")? I really didn't get
any convincing feedback to not do this change.

For smart nics where a lot of statistics computations and
driver/Hardware communication go through Firmware, there is not much we
can do in the driver in order to provide accurate and "fresh"
statistics other than talking directly to Firmware through command
interfaces that "might sleep". Currently this is done through a
background thread caching hardware statistics 5 times a second which is
wasteful.

In order to complete this work I need to take care of netlink, ovs and
bonding (similar to what i did in this patch).

Plus I might need to introduce a new dev hold/put API that guarantees
the netdev is not uninitialized while it is being briefly held, this is
needed since the current device chain locks guarantees that by design.
(Should be a relatively small change).

Go/No Go?

Thanks,
Saeed.
Eric Dumazet April 16, 2018, 9:07 p.m. UTC | #9
On 04/16/2018 01:50 PM, Saeed Mahameed wrote:
> 
> Dave, Eric, I would like to know whether it is ok to have this change
> in the kernel (make get_stats ndo "might sleep")? I really didn't get
> any convincing feedback to not do this change.

Fact that you were not convinced does not mean you were right.

I would rather not change the current behavior and risk nasty bugs.

Having SNMP daemons blocked while some device is sleeping in its ndo_get_stats()
is concerning, since it is currently an immediate operation.

ethtool -S can definitely sleep if you need fancy stats.
diff mbox series

Patch

diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c
index 9737302907b1..9d5ce6a203d2 100644
--- a/net/core/net-procfs.c
+++ b/net/core/net-procfs.c
@@ -31,19 +31,24 @@  static inline struct net_device *dev_from_same_bucket(struct seq_file *seq, loff
 
 static inline struct net_device *dev_from_bucket(struct seq_file *seq, loff_t *pos)
 {
-	struct net_device *dev;
+	struct net_device *dev = NULL;
 	unsigned int bucket;
 
+	rcu_read_lock();
 	do {
 		dev = dev_from_same_bucket(seq, pos);
-		if (dev)
-			return dev;
+		if (dev) {
+			dev_hold(dev);
+			goto unlock;
+		}
 
 		bucket = get_bucket(*pos) + 1;
 		*pos = set_bucket_offset(bucket, 1);
 	} while (bucket < NETDEV_HASHENTRIES);
 
-	return NULL;
+unlock:
+	rcu_read_unlock();
+	return dev;
 }
 
 /*
@@ -51,9 +56,7 @@  static inline struct net_device *dev_from_bucket(struct seq_file *seq, loff_t *p
  *	in detail.
  */
 static void *dev_seq_start(struct seq_file *seq, loff_t *pos)
-	__acquires(RCU)
 {
-	rcu_read_lock();
 	if (!*pos)
 		return SEQ_START_TOKEN;
 
@@ -66,13 +69,15 @@  static void *dev_seq_start(struct seq_file *seq, loff_t *pos)
 static void *dev_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 {
 	++*pos;
+	if (v && v != SEQ_START_TOKEN)
+		dev_put(v);
 	return dev_from_bucket(seq, pos);
 }
 
 static void dev_seq_stop(struct seq_file *seq, void *v)
-	__releases(RCU)
 {
-	rcu_read_unlock();
+	if (v && v != SEQ_START_TOKEN)
+		dev_put(v);
 }
 
 static void dev_seq_printf_stats(struct seq_file *seq, struct net_device *dev)