Patchwork dropmon: add ability to detect when hardware dropsrxpackets

login
register
mail settings
Submitter Neil Horman
Date May 14, 2009, 5:29 p.m.
Message ID <20090514172954.GA3867@hmsreliant.think-freely.org>
Download mbox | patch
Permalink /patch/27217/
State Superseded
Delegated to: David Miller
Headers show

Comments

Neil Horman - May 14, 2009, 5:29 p.m.
On Thu, May 14, 2009 at 02:44:08PM +0200, Jiri Pirko wrote:
> >+
> >+		/*
> >+		 * Clean the device list
> >+		 */
> >+		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
> 		^^^^^^^^^^^^^^^^^^^^^^^
> This is meaningless here. Use list_for_each_entry_rcu only under rcu_read_lock.
> Also it would be good to use list_for_each_entry_safe here since you're
> modifying the list.
> 

The definition of list_for_each_entry_rcu specifically says its safe against
list-mutation primitives, so its fine.  Although you are correct, in that its
safety is dependent on the protection of rcu_read_lock(), so I'll add that in.
Thanks for the catch!  New patch attached

Change notes:
1) Add rcu_read_lock/unlock protection around TRACE_OFF event

Neil


Patch to add the ability to detect drops in hardware interfaces via dropwatch.
Adds a tracepoint to net_rx_action to signal everytime a napi instance is
polled.  The dropmon code then periodically checks to see if the rx_frames
counter has changed, and if so, adds a drop notification to the netlink
protocol, using the reserved all-0's vector to indicate the drop location was in
hardware, rather than somewhere in the code.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 include/linux/net_dropmon.h |    7 ++
 include/trace/napi.h        |   11 ++++
 net/core/dev.c              |    5 +
 net/core/drop_monitor.c     |  117 ++++++++++++++++++++++++++++++++++++++++++--
 net/core/net-traces.c       |    4 +
 net/core/netpoll.c          |    2 
 6 files changed, 141 insertions(+), 5 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - May 15, 2009, 5:49 a.m.
On 14-05-2009 19:29, Neil Horman wrote:
> On Thu, May 14, 2009 at 02:44:08PM +0200, Jiri Pirko wrote:
>>> +
>>> +		/*
>>> +		 * Clean the device list
>>> +		 */
>>> +		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
>> 		^^^^^^^^^^^^^^^^^^^^^^^
>> This is meaningless here. Use list_for_each_entry_rcu only under rcu_read_lock.
>> Also it would be good to use list_for_each_entry_safe here since you're
>> modifying the list.
>>
> 
> The definition of list_for_each_entry_rcu specifically says its safe against
> list-mutation primitives, so its fine.  Although you are correct, in that its
> safety is dependent on the protection of rcu_read_lock(), so I'll add that in.
> Thanks for the catch!  New patch attached
> 
> Change notes:
> 1) Add rcu_read_lock/unlock protection around TRACE_OFF event
> 
> Neil
...
>  static int set_all_monitor_traces(int state)
>  {
>  	int rc = 0;
> +	struct dm_hw_stat_delta *new_stat = NULL;
> +
> +	spin_lock(&trace_state_lock);
>  
>  	switch (state) {
>  	case TRACE_ON:
>  		rc |= register_trace_kfree_skb(trace_kfree_skb_hit);
> +		rc |= register_trace_napi_poll(trace_napi_poll_hit);
>  		break;
>  	case TRACE_OFF:
>  		rc |= unregister_trace_kfree_skb(trace_kfree_skb_hit);
> +		rc |= unregister_trace_napi_poll(trace_napi_poll_hit);
>  
>  		tracepoint_synchronize_unregister();
> +
> +		/*
> +		 * Clean the device list
> +		 */
> +		rcu_read_lock();
> +		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
> +			if (new_stat->dev == NULL) {
> +				list_del_rcu(&new_stat->list);
> +				call_rcu(&new_stat->rcu, free_dm_hw_stat);
> +			}
> +		}
> +		rcu_read_unlock();

IMHO it looks worse now. rcu_read_lock() suggests it's a read side,
and spin_lock(&trace_state_lock) protects something else.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Pirko - May 15, 2009, 6:51 a.m.
Thu, May 14, 2009 at 07:29:54PM CEST, nhorman@tuxdriver.com wrote:
>On Thu, May 14, 2009 at 02:44:08PM +0200, Jiri Pirko wrote:
>> >+
>> >+		/*
>> >+		 * Clean the device list
>> >+		 */
>> >+		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
>> 		^^^^^^^^^^^^^^^^^^^^^^^
>> This is meaningless here. Use list_for_each_entry_rcu only under rcu_read_lock.
>> Also it would be good to use list_for_each_entry_safe here since you're
>> modifying the list.
>> 
>
>The definition of list_for_each_entry_rcu specifically says its safe against
>list-mutation primitives, so its fine.  Although you are correct, in that its
>safety is dependent on the protection of rcu_read_lock(), so I'll add that in.

You are right that list_for_each_entry_rcu is safe against list-mutation
primitives. But there's no need for this on the update side when you hold a writer
spinlock. Here I think it's better (and also less confusing) to use ordinary
list_for_each_entry which in this case must be list_for_each_entry_safe.

		Jirka

>Thanks for the catch!  New patch attached
>
>Change notes:
>1) Add rcu_read_lock/unlock protection around TRACE_OFF event
>
>Neil
>
>
>Patch to add the ability to detect drops in hardware interfaces via dropwatch.
>Adds a tracepoint to net_rx_action to signal everytime a napi instance is
>polled.  The dropmon code then periodically checks to see if the rx_frames
>counter has changed, and if so, adds a drop notification to the netlink
>protocol, using the reserved all-0's vector to indicate the drop location was in
>hardware, rather than somewhere in the code.
>
>Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>
>
> include/linux/net_dropmon.h |    7 ++
> include/trace/napi.h        |   11 ++++
> net/core/dev.c              |    5 +
> net/core/drop_monitor.c     |  117 ++++++++++++++++++++++++++++++++++++++++++--
> net/core/net-traces.c       |    4 +
> net/core/netpoll.c          |    2 
> 6 files changed, 141 insertions(+), 5 deletions(-)
>
>diff --git a/include/linux/net_dropmon.h b/include/linux/net_dropmon.h
>index 0217fb8..9addc62 100644
>--- a/include/linux/net_dropmon.h
>+++ b/include/linux/net_dropmon.h
>@@ -8,6 +8,13 @@ struct net_dm_drop_point {
> 	__u32 count;
> };
> 
>+#define is_drop_point_hw(x) do {\
>+	int ____i, ____j;\
>+	for (____i = 0; ____i < 8; i ____i++)\
>+		____j |= x[____i];\
>+	____j;\
>+} while (0)
>+
> #define NET_DM_CFG_VERSION  0
> #define NET_DM_CFG_ALERT_COUNT  1
> #define NET_DM_CFG_ALERT_DELAY 2
>diff --git a/include/trace/napi.h b/include/trace/napi.h
>new file mode 100644
>index 0000000..7c39eb7
>--- /dev/null
>+++ b/include/trace/napi.h
>@@ -0,0 +1,11 @@
>+#ifndef _TRACE_NAPI_H_
>+#define _TRACE_NAPI_H_
>+
>+#include <linux/skbuff.h>
>+#include <linux/tracepoint.h>
>+
>+DECLARE_TRACE(napi_poll,
>+	TP_PROTO(struct napi_struct *napi),
>+	TP_ARGS(napi));
>+
>+#endif
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 637ea71..165979c 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -126,6 +126,7 @@
> #include <linux/in.h>
> #include <linux/jhash.h>
> #include <linux/random.h>
>+#include <trace/napi.h>
> 
> #include "net-sysfs.h"
> 
>@@ -2763,8 +2764,10 @@ static void net_rx_action(struct softirq_action *h)
> 		 * accidently calling ->poll() when NAPI is not scheduled.
> 		 */
> 		work = 0;
>-		if (test_bit(NAPI_STATE_SCHED, &n->state))
>+		if (test_bit(NAPI_STATE_SCHED, &n->state)) {
> 			work = n->poll(n, weight);
>+			trace_napi_poll(n);
>+		}
> 
> 		WARN_ON_ONCE(work > weight);
> 
>diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
>index 2797b71..f9130d5 100644
>--- a/net/core/drop_monitor.c
>+++ b/net/core/drop_monitor.c
>@@ -22,8 +22,10 @@
> #include <linux/timer.h>
> #include <linux/bitops.h>
> #include <net/genetlink.h>
>+#include <net/netevent.h>
> 
> #include <trace/skb.h>
>+#include <trace/napi.h>
> 
> #include <asm/unaligned.h>
> 
>@@ -38,7 +40,8 @@ static void send_dm_alert(struct work_struct *unused);
>  * and the work handle that will send up
>  * netlink alerts
>  */
>-struct sock *dm_sock;
>+static int trace_state = TRACE_OFF;
>+static spinlock_t trace_state_lock = SPIN_LOCK_UNLOCKED;
> 
> struct per_cpu_dm_data {
> 	struct work_struct dm_alert_work;
>@@ -47,6 +50,13 @@ struct per_cpu_dm_data {
> 	struct timer_list send_timer;
> };
> 
>+struct dm_hw_stat_delta {
>+	struct net_device *dev;
>+	struct list_head list;
>+	struct rcu_head rcu;
>+	unsigned long last_drop_val;
>+};
>+
> static struct genl_family net_drop_monitor_family = {
> 	.id             = GENL_ID_GENERATE,
> 	.hdrsize        = 0,
>@@ -59,7 +69,8 @@ static DEFINE_PER_CPU(struct per_cpu_dm_data, dm_cpu_data);
> 
> static int dm_hit_limit = 64;
> static int dm_delay = 1;
>-
>+static unsigned long dm_hw_check_delta = 2*HZ;
>+static LIST_HEAD(hw_stats_list);
> 
> static void reset_per_cpu_data(struct per_cpu_dm_data *data)
> {
>@@ -115,7 +126,7 @@ static void sched_send_work(unsigned long unused)
> 	schedule_work(&data->dm_alert_work);
> }
> 
>-static void trace_kfree_skb_hit(struct sk_buff *skb, void *location)
>+static void trace_drop_common(struct sk_buff *skb, void *location)
> {
> 	struct net_dm_alert_msg *msg;
> 	struct nlmsghdr *nlh;
>@@ -159,26 +170,81 @@ out:
> 	return;
> }
> 
>+static void trace_kfree_skb_hit(struct sk_buff *skb, void *location)
>+{
>+	trace_drop_common(skb, location);
>+}
>+
>+static void trace_napi_poll_hit(struct napi_struct *napi)
>+{
>+	struct dm_hw_stat_delta *new_stat;
>+
>+	/*
>+	 * Ratelimit our check time to dm_hw_check_delta jiffies
>+	 */
>+	if (!time_after(jiffies, napi->dev->last_rx + dm_hw_check_delta))
>+		return;
>+
>+	rcu_read_lock();
>+	list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
>+		if ((new_stat->dev == napi->dev)  &&
>+		    (napi->dev->stats.rx_dropped != new_stat->last_drop_val)) {
>+			trace_drop_common(NULL, NULL);
>+			new_stat->last_drop_val = napi->dev->stats.rx_dropped;
>+			break;
>+		}
>+	}
>+	rcu_read_unlock();
>+}
>+
>+
>+static void free_dm_hw_stat(struct rcu_head *head)
>+{
>+	struct dm_hw_stat_delta *n;
>+	n = container_of(head, struct dm_hw_stat_delta, rcu);
>+	kfree(n);
>+}
>+
> static int set_all_monitor_traces(int state)
> {
> 	int rc = 0;
>+	struct dm_hw_stat_delta *new_stat = NULL;
>+
>+	spin_lock(&trace_state_lock);
> 
> 	switch (state) {
> 	case TRACE_ON:
> 		rc |= register_trace_kfree_skb(trace_kfree_skb_hit);
>+		rc |= register_trace_napi_poll(trace_napi_poll_hit);
> 		break;
> 	case TRACE_OFF:
> 		rc |= unregister_trace_kfree_skb(trace_kfree_skb_hit);
>+		rc |= unregister_trace_napi_poll(trace_napi_poll_hit);
> 
> 		tracepoint_synchronize_unregister();
>+
>+		/*
>+		 * Clean the device list
>+		 */
>+		rcu_read_lock();
>+		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
>+			if (new_stat->dev == NULL) {
>+				list_del_rcu(&new_stat->list);
>+				call_rcu(&new_stat->rcu, free_dm_hw_stat);
>+			}
>+		}
>+		rcu_read_unlock();
> 		break;
> 	default:
> 		rc = 1;
> 		break;
> 	}
> 
>+	spin_unlock(&trace_state_lock);
>+
> 	if (rc)
> 		return -EINPROGRESS;
>+	trace_state = state;
> 	return rc;
> }
> 
>@@ -204,6 +270,38 @@ static int net_dm_cmd_trace(struct sk_buff *skb,
> 	return -ENOTSUPP;
> }
> 
>+static int dropmon_net_event(struct notifier_block *ev_block,
>+			unsigned long event, void *ptr)
>+{
>+	struct net_device *dev = ptr;
>+	struct dm_hw_stat_delta *new_stat = NULL;
>+
>+	switch (event) {
>+	case NETDEV_REGISTER:
>+		new_stat = kzalloc(sizeof(struct dm_hw_stat_delta), GFP_KERNEL);
>+
>+		if (!new_stat)
>+			goto out;
>+
>+		new_stat->dev = dev;
>+		INIT_RCU_HEAD(&new_stat->rcu);
>+		spin_lock(&trace_state_lock);
>+		list_add_rcu(&new_stat->list, &hw_stats_list);
>+		spin_unlock(&trace_state_lock);
>+		break;
>+	case NETDEV_UNREGISTER:
>+		rcu_read_lock();
>+		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
>+			if (new_stat->dev == dev)
>+				new_stat->dev = NULL;
>+				break;
>+		}
>+		rcu_read_unlock();
>+		break;
>+	}
>+out:
>+	return NOTIFY_DONE;
>+}
> 
> static struct genl_ops dropmon_ops[] = {
> 	{
>@@ -220,6 +318,10 @@ static struct genl_ops dropmon_ops[] = {
> 	},
> };
> 
>+static struct notifier_block dropmon_net_notifier = {
>+	.notifier_call = dropmon_net_event
>+};
>+
> static int __init init_net_drop_monitor(void)
> {
> 	int cpu;
>@@ -243,12 +345,18 @@ static int __init init_net_drop_monitor(void)
> 		ret = genl_register_ops(&net_drop_monitor_family,
> 					&dropmon_ops[i]);
> 		if (ret) {
>-			printk(KERN_CRIT "failed to register operation %d\n",
>+			printk(KERN_CRIT "Failed to register operation %d\n",
> 				dropmon_ops[i].cmd);
> 			goto out_unreg;
> 		}
> 	}
> 
>+	rc = register_netdevice_notifier(&dropmon_net_notifier);
>+	if (rc < 0) {
>+		printk(KERN_CRIT "Failed to register netdevice notifier\n");
>+		goto out_unreg;
>+	}
>+
> 	rc = 0;
> 
> 	for_each_present_cpu(cpu) {
>@@ -259,6 +367,7 @@ static int __init init_net_drop_monitor(void)
> 		data->send_timer.data = cpu;
> 		data->send_timer.function = sched_send_work;
> 	}
>+
> 	goto out;
> 
> out_unreg:
>diff --git a/net/core/net-traces.c b/net/core/net-traces.c
>index c8fb456..b07b25b 100644
>--- a/net/core/net-traces.c
>+++ b/net/core/net-traces.c
>@@ -20,6 +20,7 @@
> #include <linux/netlink.h>
> #include <linux/net_dropmon.h>
> #include <trace/skb.h>
>+#include <trace/napi.h>
> 
> #include <asm/unaligned.h>
> #include <asm/bitops.h>
>@@ -27,3 +28,6 @@
> 
> DEFINE_TRACE(kfree_skb);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb);
>+
>+DEFINE_TRACE(napi_poll);
>+EXPORT_TRACEPOINT_SYMBOL_GPL(napi_poll);
>diff --git a/net/core/netpoll.c b/net/core/netpoll.c
>index b5873bd..446d3ba 100644
>--- a/net/core/netpoll.c
>+++ b/net/core/netpoll.c
>@@ -24,6 +24,7 @@
> #include <net/tcp.h>
> #include <net/udp.h>
> #include <asm/unaligned.h>
>+#include <trace/napi.h>
> 
> /*
>  * We maintain a small pool of fully-sized skbs, to make sure the
>@@ -137,6 +138,7 @@ static int poll_one_napi(struct netpoll_info *npinfo,
> 	set_bit(NAPI_STATE_NPSVC, &napi->state);
> 
> 	work = napi->poll(napi, budget);
>+	trace_napi_poll(napi->dev);
> 
> 	clear_bit(NAPI_STATE_NPSVC, &napi->state);
> 	atomic_dec(&trapped);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - May 15, 2009, 7:37 a.m.
Jiri Pirko a écrit :
> Thu, May 14, 2009 at 07:29:54PM CEST, nhorman@tuxdriver.com wrote:
>> On Thu, May 14, 2009 at 02:44:08PM +0200, Jiri Pirko wrote:
>>>> +
>>>> +		/*
>>>> +		 * Clean the device list
>>>> +		 */
>>>> +		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
>>> 		^^^^^^^^^^^^^^^^^^^^^^^
>>> This is meaningless here. Use list_for_each_entry_rcu only under rcu_read_lock.
>>> Also it would be good to use list_for_each_entry_safe here since you're
>>> modifying the list.
>>>
>> The definition of list_for_each_entry_rcu specifically says its safe against
>> list-mutation primitives, so its fine.  Although you are correct, in that its
>> safety is dependent on the protection of rcu_read_lock(), so I'll add that in.
> 
> You are right that list_for_each_entry_rcu is safe against list-mutation
> primitives. But there's no need for this on the update side when you hold a writer
> spinlock. Here I think it's better (and also less confusing) to use ordinary
> list_for_each_entry which in this case must be list_for_each_entry_safe.

Absolutely.

RCU is tricky, and must be well understood. Using the right verbs is really needed
to help everybody read the code and be able to understand it quickly and maintain it
if needed.

In this particular case, the list_for_each_entry_rcu() is a litle bit more
expensive than regular list_for_each_entry(), as it includes additionnal barriers.

Just reading code like this :

+	spin_lock(&trace_state_lock);
+		rcu_read_lock();
+		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
+			if (new_stat->dev == NULL) {
+				list_del_rcu(&new_stat->list);
+				call_rcu(&new_stat->rcu, free_dm_hw_stat);
+			}
+		}
+		rcu_read_unlock();
+	spin_unlock(&trace_state_lock);

We *know* something is wrong, as rcu_read_lock() is supposed to guard a readside,
so having both rcu_read_lock() and list_del_rcu() is certainly wrong, we dont have 
to actually understand what is really done by the algorithm.

So following is much better : (but maybe not correct... I let you 
check if list_for_each_entry_safe() would not be better here, since
you delete elements in a list while iterating in it)

Maybe you can add a break; after call_rcu() if you know only one element
can match the "if (new_stat->dev == NULL) " condition, and use normal list_for_each_entry()

+	spin_lock(&trace_state_lock);
+		list_for_each_entry(new_stat, &hw_stats_list, list) {
+			if (new_stat->dev == NULL) {
+				list_del_rcu(&new_stat->list);
+				call_rcu(&new_stat->rcu, free_dm_hw_stat);
+			}
+		}
+	spin_unlock(&trace_state_lock);

Even if the 'wrong' version was not buggy (no crashes or corruption), the 'right' one
is a commonly used construct in kernel that most dev are able to read without asking 
themselves "is is correct or not ? Dont we have something strange here ?"

Neil, you need to read more code playing with RCU to get familiar with it, we all did same
errors in the past :)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman - May 15, 2009, 10:59 a.m.
On Fri, May 15, 2009 at 08:51:02AM +0200, Jiri Pirko wrote:
> Thu, May 14, 2009 at 07:29:54PM CEST, nhorman@tuxdriver.com wrote:
> >On Thu, May 14, 2009 at 02:44:08PM +0200, Jiri Pirko wrote:
> >> >+
> >> >+		/*
> >> >+		 * Clean the device list
> >> >+		 */
> >> >+		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
> >> 		^^^^^^^^^^^^^^^^^^^^^^^
> >> This is meaningless here. Use list_for_each_entry_rcu only under rcu_read_lock.
> >> Also it would be good to use list_for_each_entry_safe here since you're
> >> modifying the list.
> >> 
> >
> >The definition of list_for_each_entry_rcu specifically says its safe against
> >list-mutation primitives, so its fine.  Although you are correct, in that its
> >safety is dependent on the protection of rcu_read_lock(), so I'll add that in.
> 
> You are right that list_for_each_entry_rcu is safe against list-mutation
> primitives. But there's no need for this on the update side when you hold a writer
> spinlock. Here I think it's better (and also less confusing) to use ordinary
> list_for_each_entry which in this case must be list_for_each_entry_safe.
> 
> 		Jirka
> 
I disagree.  I think visually its more understandable with the rcu variant, and
that spinlock isn't a writer spinlock, the other uses of this list don't use
that lock (to imporove performance).  The spinlock is there simply to serialize
changes to the trace state (so that one cpu doen't try to turn dropmon on, while
another tries to turn it off).  So we need the rcu variant to protect the list
removal against concurrent use on the reader side in the new registered trace
point function.

Neil

> >Thanks for the catch!  New patch attached
> >
> >Change notes:
> >1) Add rcu_read_lock/unlock protection around TRACE_OFF event
> >
> >Neil
> >
> >
> >Patch to add the ability to detect drops in hardware interfaces via dropwatch.
> >Adds a tracepoint to net_rx_action to signal everytime a napi instance is
> >polled.  The dropmon code then periodically checks to see if the rx_frames
> >counter has changed, and if so, adds a drop notification to the netlink
> >protocol, using the reserved all-0's vector to indicate the drop location was in
> >hardware, rather than somewhere in the code.
> >
> >Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> >
> >
> > include/linux/net_dropmon.h |    7 ++
> > include/trace/napi.h        |   11 ++++
> > net/core/dev.c              |    5 +
> > net/core/drop_monitor.c     |  117 ++++++++++++++++++++++++++++++++++++++++++--
> > net/core/net-traces.c       |    4 +
> > net/core/netpoll.c          |    2 
> > 6 files changed, 141 insertions(+), 5 deletions(-)
> >
> >diff --git a/include/linux/net_dropmon.h b/include/linux/net_dropmon.h
> >index 0217fb8..9addc62 100644
> >--- a/include/linux/net_dropmon.h
> >+++ b/include/linux/net_dropmon.h
> >@@ -8,6 +8,13 @@ struct net_dm_drop_point {
> > 	__u32 count;
> > };
> > 
> >+#define is_drop_point_hw(x) do {\
> >+	int ____i, ____j;\
> >+	for (____i = 0; ____i < 8; i ____i++)\
> >+		____j |= x[____i];\
> >+	____j;\
> >+} while (0)
> >+
> > #define NET_DM_CFG_VERSION  0
> > #define NET_DM_CFG_ALERT_COUNT  1
> > #define NET_DM_CFG_ALERT_DELAY 2
> >diff --git a/include/trace/napi.h b/include/trace/napi.h
> >new file mode 100644
> >index 0000000..7c39eb7
> >--- /dev/null
> >+++ b/include/trace/napi.h
> >@@ -0,0 +1,11 @@
> >+#ifndef _TRACE_NAPI_H_
> >+#define _TRACE_NAPI_H_
> >+
> >+#include <linux/skbuff.h>
> >+#include <linux/tracepoint.h>
> >+
> >+DECLARE_TRACE(napi_poll,
> >+	TP_PROTO(struct napi_struct *napi),
> >+	TP_ARGS(napi));
> >+
> >+#endif
> >diff --git a/net/core/dev.c b/net/core/dev.c
> >index 637ea71..165979c 100644
> >--- a/net/core/dev.c
> >+++ b/net/core/dev.c
> >@@ -126,6 +126,7 @@
> > #include <linux/in.h>
> > #include <linux/jhash.h>
> > #include <linux/random.h>
> >+#include <trace/napi.h>
> > 
> > #include "net-sysfs.h"
> > 
> >@@ -2763,8 +2764,10 @@ static void net_rx_action(struct softirq_action *h)
> > 		 * accidently calling ->poll() when NAPI is not scheduled.
> > 		 */
> > 		work = 0;
> >-		if (test_bit(NAPI_STATE_SCHED, &n->state))
> >+		if (test_bit(NAPI_STATE_SCHED, &n->state)) {
> > 			work = n->poll(n, weight);
> >+			trace_napi_poll(n);
> >+		}
> > 
> > 		WARN_ON_ONCE(work > weight);
> > 
> >diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
> >index 2797b71..f9130d5 100644
> >--- a/net/core/drop_monitor.c
> >+++ b/net/core/drop_monitor.c
> >@@ -22,8 +22,10 @@
> > #include <linux/timer.h>
> > #include <linux/bitops.h>
> > #include <net/genetlink.h>
> >+#include <net/netevent.h>
> > 
> > #include <trace/skb.h>
> >+#include <trace/napi.h>
> > 
> > #include <asm/unaligned.h>
> > 
> >@@ -38,7 +40,8 @@ static void send_dm_alert(struct work_struct *unused);
> >  * and the work handle that will send up
> >  * netlink alerts
> >  */
> >-struct sock *dm_sock;
> >+static int trace_state = TRACE_OFF;
> >+static spinlock_t trace_state_lock = SPIN_LOCK_UNLOCKED;
> > 
> > struct per_cpu_dm_data {
> > 	struct work_struct dm_alert_work;
> >@@ -47,6 +50,13 @@ struct per_cpu_dm_data {
> > 	struct timer_list send_timer;
> > };
> > 
> >+struct dm_hw_stat_delta {
> >+	struct net_device *dev;
> >+	struct list_head list;
> >+	struct rcu_head rcu;
> >+	unsigned long last_drop_val;
> >+};
> >+
> > static struct genl_family net_drop_monitor_family = {
> > 	.id             = GENL_ID_GENERATE,
> > 	.hdrsize        = 0,
> >@@ -59,7 +69,8 @@ static DEFINE_PER_CPU(struct per_cpu_dm_data, dm_cpu_data);
> > 
> > static int dm_hit_limit = 64;
> > static int dm_delay = 1;
> >-
> >+static unsigned long dm_hw_check_delta = 2*HZ;
> >+static LIST_HEAD(hw_stats_list);
> > 
> > static void reset_per_cpu_data(struct per_cpu_dm_data *data)
> > {
> >@@ -115,7 +126,7 @@ static void sched_send_work(unsigned long unused)
> > 	schedule_work(&data->dm_alert_work);
> > }
> > 
> >-static void trace_kfree_skb_hit(struct sk_buff *skb, void *location)
> >+static void trace_drop_common(struct sk_buff *skb, void *location)
> > {
> > 	struct net_dm_alert_msg *msg;
> > 	struct nlmsghdr *nlh;
> >@@ -159,26 +170,81 @@ out:
> > 	return;
> > }
> > 
> >+static void trace_kfree_skb_hit(struct sk_buff *skb, void *location)
> >+{
> >+	trace_drop_common(skb, location);
> >+}
> >+
> >+static void trace_napi_poll_hit(struct napi_struct *napi)
> >+{
> >+	struct dm_hw_stat_delta *new_stat;
> >+
> >+	/*
> >+	 * Ratelimit our check time to dm_hw_check_delta jiffies
> >+	 */
> >+	if (!time_after(jiffies, napi->dev->last_rx + dm_hw_check_delta))
> >+		return;
> >+
> >+	rcu_read_lock();
> >+	list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
> >+		if ((new_stat->dev == napi->dev)  &&
> >+		    (napi->dev->stats.rx_dropped != new_stat->last_drop_val)) {
> >+			trace_drop_common(NULL, NULL);
> >+			new_stat->last_drop_val = napi->dev->stats.rx_dropped;
> >+			break;
> >+		}
> >+	}
> >+	rcu_read_unlock();
> >+}
> >+
> >+
> >+static void free_dm_hw_stat(struct rcu_head *head)
> >+{
> >+	struct dm_hw_stat_delta *n;
> >+	n = container_of(head, struct dm_hw_stat_delta, rcu);
> >+	kfree(n);
> >+}
> >+
> > static int set_all_monitor_traces(int state)
> > {
> > 	int rc = 0;
> >+	struct dm_hw_stat_delta *new_stat = NULL;
> >+
> >+	spin_lock(&trace_state_lock);
> > 
> > 	switch (state) {
> > 	case TRACE_ON:
> > 		rc |= register_trace_kfree_skb(trace_kfree_skb_hit);
> >+		rc |= register_trace_napi_poll(trace_napi_poll_hit);
> > 		break;
> > 	case TRACE_OFF:
> > 		rc |= unregister_trace_kfree_skb(trace_kfree_skb_hit);
> >+		rc |= unregister_trace_napi_poll(trace_napi_poll_hit);
> > 
> > 		tracepoint_synchronize_unregister();
> >+
> >+		/*
> >+		 * Clean the device list
> >+		 */
> >+		rcu_read_lock();
> >+		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
> >+			if (new_stat->dev == NULL) {
> >+				list_del_rcu(&new_stat->list);
> >+				call_rcu(&new_stat->rcu, free_dm_hw_stat);
> >+			}
> >+		}
> >+		rcu_read_unlock();
> > 		break;
> > 	default:
> > 		rc = 1;
> > 		break;
> > 	}
> > 
> >+	spin_unlock(&trace_state_lock);
> >+
> > 	if (rc)
> > 		return -EINPROGRESS;
> >+	trace_state = state;
> > 	return rc;
> > }
> > 
> >@@ -204,6 +270,38 @@ static int net_dm_cmd_trace(struct sk_buff *skb,
> > 	return -ENOTSUPP;
> > }
> > 
> >+static int dropmon_net_event(struct notifier_block *ev_block,
> >+			unsigned long event, void *ptr)
> >+{
> >+	struct net_device *dev = ptr;
> >+	struct dm_hw_stat_delta *new_stat = NULL;
> >+
> >+	switch (event) {
> >+	case NETDEV_REGISTER:
> >+		new_stat = kzalloc(sizeof(struct dm_hw_stat_delta), GFP_KERNEL);
> >+
> >+		if (!new_stat)
> >+			goto out;
> >+
> >+		new_stat->dev = dev;
> >+		INIT_RCU_HEAD(&new_stat->rcu);
> >+		spin_lock(&trace_state_lock);
> >+		list_add_rcu(&new_stat->list, &hw_stats_list);
> >+		spin_unlock(&trace_state_lock);
> >+		break;
> >+	case NETDEV_UNREGISTER:
> >+		rcu_read_lock();
> >+		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
> >+			if (new_stat->dev == dev)
> >+				new_stat->dev = NULL;
> >+				break;
> >+		}
> >+		rcu_read_unlock();
> >+		break;
> >+	}
> >+out:
> >+	return NOTIFY_DONE;
> >+}
> > 
> > static struct genl_ops dropmon_ops[] = {
> > 	{
> >@@ -220,6 +318,10 @@ static struct genl_ops dropmon_ops[] = {
> > 	},
> > };
> > 
> >+static struct notifier_block dropmon_net_notifier = {
> >+	.notifier_call = dropmon_net_event
> >+};
> >+
> > static int __init init_net_drop_monitor(void)
> > {
> > 	int cpu;
> >@@ -243,12 +345,18 @@ static int __init init_net_drop_monitor(void)
> > 		ret = genl_register_ops(&net_drop_monitor_family,
> > 					&dropmon_ops[i]);
> > 		if (ret) {
> >-			printk(KERN_CRIT "failed to register operation %d\n",
> >+			printk(KERN_CRIT "Failed to register operation %d\n",
> > 				dropmon_ops[i].cmd);
> > 			goto out_unreg;
> > 		}
> > 	}
> > 
> >+	rc = register_netdevice_notifier(&dropmon_net_notifier);
> >+	if (rc < 0) {
> >+		printk(KERN_CRIT "Failed to register netdevice notifier\n");
> >+		goto out_unreg;
> >+	}
> >+
> > 	rc = 0;
> > 
> > 	for_each_present_cpu(cpu) {
> >@@ -259,6 +367,7 @@ static int __init init_net_drop_monitor(void)
> > 		data->send_timer.data = cpu;
> > 		data->send_timer.function = sched_send_work;
> > 	}
> >+
> > 	goto out;
> > 
> > out_unreg:
> >diff --git a/net/core/net-traces.c b/net/core/net-traces.c
> >index c8fb456..b07b25b 100644
> >--- a/net/core/net-traces.c
> >+++ b/net/core/net-traces.c
> >@@ -20,6 +20,7 @@
> > #include <linux/netlink.h>
> > #include <linux/net_dropmon.h>
> > #include <trace/skb.h>
> >+#include <trace/napi.h>
> > 
> > #include <asm/unaligned.h>
> > #include <asm/bitops.h>
> >@@ -27,3 +28,6 @@
> > 
> > DEFINE_TRACE(kfree_skb);
> > EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb);
> >+
> >+DEFINE_TRACE(napi_poll);
> >+EXPORT_TRACEPOINT_SYMBOL_GPL(napi_poll);
> >diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> >index b5873bd..446d3ba 100644
> >--- a/net/core/netpoll.c
> >+++ b/net/core/netpoll.c
> >@@ -24,6 +24,7 @@
> > #include <net/tcp.h>
> > #include <net/udp.h>
> > #include <asm/unaligned.h>
> >+#include <trace/napi.h>
> > 
> > /*
> >  * We maintain a small pool of fully-sized skbs, to make sure the
> >@@ -137,6 +138,7 @@ static int poll_one_napi(struct netpoll_info *npinfo,
> > 	set_bit(NAPI_STATE_NPSVC, &napi->state);
> > 
> > 	work = napi->poll(napi, budget);
> >+	trace_napi_poll(napi->dev);
> > 
> > 	clear_bit(NAPI_STATE_NPSVC, &napi->state);
> > 	atomic_dec(&trapped);
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman - May 15, 2009, 11:01 a.m.
On Fri, May 15, 2009 at 05:49:47AM +0000, Jarek Poplawski wrote:
> On 14-05-2009 19:29, Neil Horman wrote:
> > On Thu, May 14, 2009 at 02:44:08PM +0200, Jiri Pirko wrote:
> >>> +
> >>> +		/*
> >>> +		 * Clean the device list
> >>> +		 */
> >>> +		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
> >> 		^^^^^^^^^^^^^^^^^^^^^^^
> >> This is meaningless here. Use list_for_each_entry_rcu only under rcu_read_lock.
> >> Also it would be good to use list_for_each_entry_safe here since you're
> >> modifying the list.
> >>
> > 
> > The definition of list_for_each_entry_rcu specifically says its safe against
> > list-mutation primitives, so its fine.  Although you are correct, in that its
> > safety is dependent on the protection of rcu_read_lock(), so I'll add that in.
> > Thanks for the catch!  New patch attached
> > 
> > Change notes:
> > 1) Add rcu_read_lock/unlock protection around TRACE_OFF event
> > 
> > Neil
> ...
> >  static int set_all_monitor_traces(int state)
> >  {
> >  	int rc = 0;
> > +	struct dm_hw_stat_delta *new_stat = NULL;
> > +
> > +	spin_lock(&trace_state_lock);
> >  
> >  	switch (state) {
> >  	case TRACE_ON:
> >  		rc |= register_trace_kfree_skb(trace_kfree_skb_hit);
> > +		rc |= register_trace_napi_poll(trace_napi_poll_hit);
> >  		break;
> >  	case TRACE_OFF:
> >  		rc |= unregister_trace_kfree_skb(trace_kfree_skb_hit);
> > +		rc |= unregister_trace_napi_poll(trace_napi_poll_hit);
> >  
> >  		tracepoint_synchronize_unregister();
> > +
> > +		/*
> > +		 * Clean the device list
> > +		 */
> > +		rcu_read_lock();
> > +		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
> > +			if (new_stat->dev == NULL) {
> > +				list_del_rcu(&new_stat->list);
> > +				call_rcu(&new_stat->rcu, free_dm_hw_stat);
> > +			}
> > +		}
> > +		rcu_read_unlock();
> 
> IMHO it looks worse now. rcu_read_lock() suggests it's a read side,
> and spin_lock(&trace_state_lock) protects something else.
> 
the read lock is required (according to the comments for the list loop
primitive) to protect against the embedded mutation primitive, so its required.
I understand that its a bit counterintuitive, but intuition takes a backseat to
functionality. :)
Neil

> Jarek P.
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - May 15, 2009, 11:12 a.m.
On Fri, May 15, 2009 at 07:01:41AM -0400, Neil Horman wrote:
> On Fri, May 15, 2009 at 05:49:47AM +0000, Jarek Poplawski wrote:
...
> > IMHO it looks worse now. rcu_read_lock() suggests it's a read side,
> > and spin_lock(&trace_state_lock) protects something else.
> > 
> the read lock is required (according to the comments for the list loop
> primitive) to protect against the embedded mutation primitive, so its required.
> I understand that its a bit counterintuitive, but intuition takes a backseat to
> functionality. :)
> Neil
> 

I guess, you missed:

> Looks good from an RCU viewpoint!
> 
> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

for the previous version...

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman - May 15, 2009, 11:12 a.m.
On Fri, May 15, 2009 at 09:37:28AM +0200, Eric Dumazet wrote:
> Jiri Pirko a écrit :
> > Thu, May 14, 2009 at 07:29:54PM CEST, nhorman@tuxdriver.com wrote:
> >> On Thu, May 14, 2009 at 02:44:08PM +0200, Jiri Pirko wrote:
> >>>> +
> >>>> +		/*
> >>>> +		 * Clean the device list
> >>>> +		 */
> >>>> +		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
> >>> 		^^^^^^^^^^^^^^^^^^^^^^^
> >>> This is meaningless here. Use list_for_each_entry_rcu only under rcu_read_lock.
> >>> Also it would be good to use list_for_each_entry_safe here since you're
> >>> modifying the list.
> >>>
> >> The definition of list_for_each_entry_rcu specifically says its safe against
> >> list-mutation primitives, so its fine.  Although you are correct, in that its
> >> safety is dependent on the protection of rcu_read_lock(), so I'll add that in.
> > 
> > You are right that list_for_each_entry_rcu is safe against list-mutation
> > primitives. But there's no need for this on the update side when you hold a writer
> > spinlock. Here I think it's better (and also less confusing) to use ordinary
> > list_for_each_entry which in this case must be list_for_each_entry_safe.
> 
> Absolutely.
> 
> RCU is tricky, and must be well understood. Using the right verbs is really needed
> to help everybody read the code and be able to understand it quickly and maintain it
> if needed.
> 
> In this particular case, the list_for_each_entry_rcu() is a litle bit more
> expensive than regular list_for_each_entry(), as it includes additionnal barriers.
> 
> Just reading code like this :
> 
> +	spin_lock(&trace_state_lock);
> +		rcu_read_lock();
> +		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
> +			if (new_stat->dev == NULL) {
> +				list_del_rcu(&new_stat->list);
> +				call_rcu(&new_stat->rcu, free_dm_hw_stat);
> +			}
> +		}
> +		rcu_read_unlock();
> +	spin_unlock(&trace_state_lock);
> 
> We *know* something is wrong, as rcu_read_lock() is supposed to guard a readside,
> so having both rcu_read_lock() and list_del_rcu() is certainly wrong, we dont have 
> to actually understand what is really done by the algorithm.
> 
> So following is much better : (but maybe not correct... I let you 
> check if list_for_each_entry_safe() would not be better here, since
> you delete elements in a list while iterating in it)
> 
> Maybe you can add a break; after call_rcu() if you know only one element
> can match the "if (new_stat->dev == NULL) " condition, and use normal list_for_each_entry()
> 
> +	spin_lock(&trace_state_lock);
> +		list_for_each_entry(new_stat, &hw_stats_list, list) {
> +			if (new_stat->dev == NULL) {
> +				list_del_rcu(&new_stat->list);
> +				call_rcu(&new_stat->rcu, free_dm_hw_stat);
> +			}
> +		}
> +	spin_unlock(&trace_state_lock);
> 
> Even if the 'wrong' version was not buggy (no crashes or corruption), the 'right' one
> is a commonly used construct in kernel that most dev are able to read without asking 
> themselves "is is correct or not ? Dont we have something strange here ?"
> 
> Neil, you need to read more code playing with RCU to get familiar with it, we all did same
> errors in the past :)
> 

Thanks :)
So help me understand something here then.  The trace_state_lock has no
intention of protecting the actual read side of this list (the the napi trace
hook that this patch adds).  My understanding of list_for_each_entry_rcu and
list_del_rcu, was that even with a call list_del_rcu, the list remained
unchanged unti a quiescent point was passed by all cpus (which is detected via
the counter in rcu_read_[un]lock.  How is it insufficient to use
rcu_read_[un]lock here to implement that protection?  I agree it looks
counterintuitive, but it makes sense to me from a function standpoint.

Further to that point, given the example that you have above of what you think
_should_ work, is exactly what I have in the set_all_monitor_trace function
currently (saving for the conversion of list_for_each_entry_rcu to
list_for_each_entry_safe and the removal of the read_lock).  If those changes
improve performance, then I can get behind them, but I don't see what problem
they avoid if I don't use them.  Can you clarify that?

Regards
Neil

> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman - May 15, 2009, 11:15 a.m.
On Fri, May 15, 2009 at 11:12:14AM +0000, Jarek Poplawski wrote:
> On Fri, May 15, 2009 at 07:01:41AM -0400, Neil Horman wrote:
> > On Fri, May 15, 2009 at 05:49:47AM +0000, Jarek Poplawski wrote:
> ...
> > > IMHO it looks worse now. rcu_read_lock() suggests it's a read side,
> > > and spin_lock(&trace_state_lock) protects something else.
> > > 
> > the read lock is required (according to the comments for the list loop
> > primitive) to protect against the embedded mutation primitive, so its required.
> > I understand that its a bit counterintuitive, but intuition takes a backseat to
> > functionality. :)
> > Neil
> > 
> 
> I guess, you missed:
> 
> > Looks good from an RCU viewpoint!
> > 
> > Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> for the previous version...
> 
I didn't, our comments passed in flight.  Nevertheless, I'm not sure what this
adds (other than additional overhead), which I agree is bad and so might should
be removed, but there are some outstanding questions regarding if it is needed
in relation to the list primitives I'm using here.  According to Eric,
list_for_each_entry_safe might be less intrusive here, and I'm trying to figure
out if I agree. :)
Neil

> Jarek P.
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Pirko - May 15, 2009, 11:27 a.m.
Fri, May 15, 2009 at 12:59:09PM CEST, nhorman@tuxdriver.com wrote:
>On Fri, May 15, 2009 at 08:51:02AM +0200, Jiri Pirko wrote:
>> Thu, May 14, 2009 at 07:29:54PM CEST, nhorman@tuxdriver.com wrote:
>> >On Thu, May 14, 2009 at 02:44:08PM +0200, Jiri Pirko wrote:
>> >> >+
>> >> >+		/*
>> >> >+		 * Clean the device list
>> >> >+		 */
>> >> >+		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
>> >> 		^^^^^^^^^^^^^^^^^^^^^^^
>> >> This is meaningless here. Use list_for_each_entry_rcu only under rcu_read_lock.
>> >> Also it would be good to use list_for_each_entry_safe here since you're
>> >> modifying the list.
>> >> 
>> >
>> >The definition of list_for_each_entry_rcu specifically says its safe against
>> >list-mutation primitives, so its fine.  Although you are correct, in that its
>> >safety is dependent on the protection of rcu_read_lock(), so I'll add that in.
>> 
>> You are right that list_for_each_entry_rcu is safe against list-mutation
>> primitives. But there's no need for this on the update side when you hold a writer
>> spinlock. Here I think it's better (and also less confusing) to use ordinary
>> list_for_each_entry which in this case must be list_for_each_entry_safe.
>> 
>> 		Jirka
>> 
>I disagree.  I think visually its more understandable with the rcu variant, and
>that spinlock isn't a writer spinlock, the other uses of this list don't use
>that lock (to imporove performance).  The spinlock is there simply to serialize
>changes to the trace state (so that one cpu doen't try to turn dropmon on, while
>another tries to turn it off).  So we need the rcu variant to protect the list
>removal against concurrent use on the reader side in the new registered trace
>point function.

Ok, since you spinlock is not writer lock, you need to protect writers from
concurrent list updates anyway! And this spinlock does that work. Then you
_donotneed_ rcu_read_lock to read from list, you work with it as with ordinary
list while reading. As Eric wrote this approach is common.

		Jirka
>
>Neil
>
>> >Thanks for the catch!  New patch attached
>> >
>> >Change notes:
>> >1) Add rcu_read_lock/unlock protection around TRACE_OFF event
>> >
>> >Neil
>> >
>> >
>> >Patch to add the ability to detect drops in hardware interfaces via dropwatch.
>> >Adds a tracepoint to net_rx_action to signal everytime a napi instance is
>> >polled.  The dropmon code then periodically checks to see if the rx_frames
>> >counter has changed, and if so, adds a drop notification to the netlink
>> >protocol, using the reserved all-0's vector to indicate the drop location was in
>> >hardware, rather than somewhere in the code.
>> >
>> >Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>> >
>> >
>> > include/linux/net_dropmon.h |    7 ++
>> > include/trace/napi.h        |   11 ++++
>> > net/core/dev.c              |    5 +
>> > net/core/drop_monitor.c     |  117 ++++++++++++++++++++++++++++++++++++++++++--
>> > net/core/net-traces.c       |    4 +
>> > net/core/netpoll.c          |    2 
>> > 6 files changed, 141 insertions(+), 5 deletions(-)
>> >
>> >diff --git a/include/linux/net_dropmon.h b/include/linux/net_dropmon.h
>> >index 0217fb8..9addc62 100644
>> >--- a/include/linux/net_dropmon.h
>> >+++ b/include/linux/net_dropmon.h
>> >@@ -8,6 +8,13 @@ struct net_dm_drop_point {
>> > 	__u32 count;
>> > };
>> > 
>> >+#define is_drop_point_hw(x) do {\
>> >+	int ____i, ____j;\
>> >+	for (____i = 0; ____i < 8; i ____i++)\
>> >+		____j |= x[____i];\
>> >+	____j;\
>> >+} while (0)
>> >+
>> > #define NET_DM_CFG_VERSION  0
>> > #define NET_DM_CFG_ALERT_COUNT  1
>> > #define NET_DM_CFG_ALERT_DELAY 2
>> >diff --git a/include/trace/napi.h b/include/trace/napi.h
>> >new file mode 100644
>> >index 0000000..7c39eb7
>> >--- /dev/null
>> >+++ b/include/trace/napi.h
>> >@@ -0,0 +1,11 @@
>> >+#ifndef _TRACE_NAPI_H_
>> >+#define _TRACE_NAPI_H_
>> >+
>> >+#include <linux/skbuff.h>
>> >+#include <linux/tracepoint.h>
>> >+
>> >+DECLARE_TRACE(napi_poll,
>> >+	TP_PROTO(struct napi_struct *napi),
>> >+	TP_ARGS(napi));
>> >+
>> >+#endif
>> >diff --git a/net/core/dev.c b/net/core/dev.c
>> >index 637ea71..165979c 100644
>> >--- a/net/core/dev.c
>> >+++ b/net/core/dev.c
>> >@@ -126,6 +126,7 @@
>> > #include <linux/in.h>
>> > #include <linux/jhash.h>
>> > #include <linux/random.h>
>> >+#include <trace/napi.h>
>> > 
>> > #include "net-sysfs.h"
>> > 
>> >@@ -2763,8 +2764,10 @@ static void net_rx_action(struct softirq_action *h)
>> > 		 * accidently calling ->poll() when NAPI is not scheduled.
>> > 		 */
>> > 		work = 0;
>> >-		if (test_bit(NAPI_STATE_SCHED, &n->state))
>> >+		if (test_bit(NAPI_STATE_SCHED, &n->state)) {
>> > 			work = n->poll(n, weight);
>> >+			trace_napi_poll(n);
>> >+		}
>> > 
>> > 		WARN_ON_ONCE(work > weight);
>> > 
>> >diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
>> >index 2797b71..f9130d5 100644
>> >--- a/net/core/drop_monitor.c
>> >+++ b/net/core/drop_monitor.c
>> >@@ -22,8 +22,10 @@
>> > #include <linux/timer.h>
>> > #include <linux/bitops.h>
>> > #include <net/genetlink.h>
>> >+#include <net/netevent.h>
>> > 
>> > #include <trace/skb.h>
>> >+#include <trace/napi.h>
>> > 
>> > #include <asm/unaligned.h>
>> > 
>> >@@ -38,7 +40,8 @@ static void send_dm_alert(struct work_struct *unused);
>> >  * and the work handle that will send up
>> >  * netlink alerts
>> >  */
>> >-struct sock *dm_sock;
>> >+static int trace_state = TRACE_OFF;
>> >+static spinlock_t trace_state_lock = SPIN_LOCK_UNLOCKED;
>> > 
>> > struct per_cpu_dm_data {
>> > 	struct work_struct dm_alert_work;
>> >@@ -47,6 +50,13 @@ struct per_cpu_dm_data {
>> > 	struct timer_list send_timer;
>> > };
>> > 
>> >+struct dm_hw_stat_delta {
>> >+	struct net_device *dev;
>> >+	struct list_head list;
>> >+	struct rcu_head rcu;
>> >+	unsigned long last_drop_val;
>> >+};
>> >+
>> > static struct genl_family net_drop_monitor_family = {
>> > 	.id             = GENL_ID_GENERATE,
>> > 	.hdrsize        = 0,
>> >@@ -59,7 +69,8 @@ static DEFINE_PER_CPU(struct per_cpu_dm_data, dm_cpu_data);
>> > 
>> > static int dm_hit_limit = 64;
>> > static int dm_delay = 1;
>> >-
>> >+static unsigned long dm_hw_check_delta = 2*HZ;
>> >+static LIST_HEAD(hw_stats_list);
>> > 
>> > static void reset_per_cpu_data(struct per_cpu_dm_data *data)
>> > {
>> >@@ -115,7 +126,7 @@ static void sched_send_work(unsigned long unused)
>> > 	schedule_work(&data->dm_alert_work);
>> > }
>> > 
>> >-static void trace_kfree_skb_hit(struct sk_buff *skb, void *location)
>> >+static void trace_drop_common(struct sk_buff *skb, void *location)
>> > {
>> > 	struct net_dm_alert_msg *msg;
>> > 	struct nlmsghdr *nlh;
>> >@@ -159,26 +170,81 @@ out:
>> > 	return;
>> > }
>> > 
>> >+static void trace_kfree_skb_hit(struct sk_buff *skb, void *location)
>> >+{
>> >+	trace_drop_common(skb, location);
>> >+}
>> >+
>> >+static void trace_napi_poll_hit(struct napi_struct *napi)
>> >+{
>> >+	struct dm_hw_stat_delta *new_stat;
>> >+
>> >+	/*
>> >+	 * Ratelimit our check time to dm_hw_check_delta jiffies
>> >+	 */
>> >+	if (!time_after(jiffies, napi->dev->last_rx + dm_hw_check_delta))
>> >+		return;
>> >+
>> >+	rcu_read_lock();
>> >+	list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
>> >+		if ((new_stat->dev == napi->dev)  &&
>> >+		    (napi->dev->stats.rx_dropped != new_stat->last_drop_val)) {
>> >+			trace_drop_common(NULL, NULL);
>> >+			new_stat->last_drop_val = napi->dev->stats.rx_dropped;
>> >+			break;
>> >+		}
>> >+	}
>> >+	rcu_read_unlock();
>> >+}
>> >+
>> >+
>> >+static void free_dm_hw_stat(struct rcu_head *head)
>> >+{
>> >+	struct dm_hw_stat_delta *n;
>> >+	n = container_of(head, struct dm_hw_stat_delta, rcu);
>> >+	kfree(n);
>> >+}
>> >+
>> > static int set_all_monitor_traces(int state)
>> > {
>> > 	int rc = 0;
>> >+	struct dm_hw_stat_delta *new_stat = NULL;
>> >+
>> >+	spin_lock(&trace_state_lock);
>> > 
>> > 	switch (state) {
>> > 	case TRACE_ON:
>> > 		rc |= register_trace_kfree_skb(trace_kfree_skb_hit);
>> >+		rc |= register_trace_napi_poll(trace_napi_poll_hit);
>> > 		break;
>> > 	case TRACE_OFF:
>> > 		rc |= unregister_trace_kfree_skb(trace_kfree_skb_hit);
>> >+		rc |= unregister_trace_napi_poll(trace_napi_poll_hit);
>> > 
>> > 		tracepoint_synchronize_unregister();
>> >+
>> >+		/*
>> >+		 * Clean the device list
>> >+		 */
>> >+		rcu_read_lock();
>> >+		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
>> >+			if (new_stat->dev == NULL) {
>> >+				list_del_rcu(&new_stat->list);
>> >+				call_rcu(&new_stat->rcu, free_dm_hw_stat);
>> >+			}
>> >+		}
>> >+		rcu_read_unlock();
>> > 		break;
>> > 	default:
>> > 		rc = 1;
>> > 		break;
>> > 	}
>> > 
>> >+	spin_unlock(&trace_state_lock);
>> >+
>> > 	if (rc)
>> > 		return -EINPROGRESS;
>> >+	trace_state = state;
>> > 	return rc;
>> > }
>> > 
>> >@@ -204,6 +270,38 @@ static int net_dm_cmd_trace(struct sk_buff *skb,
>> > 	return -ENOTSUPP;
>> > }
>> > 
>> >+static int dropmon_net_event(struct notifier_block *ev_block,
>> >+			unsigned long event, void *ptr)
>> >+{
>> >+	struct net_device *dev = ptr;
>> >+	struct dm_hw_stat_delta *new_stat = NULL;
>> >+
>> >+	switch (event) {
>> >+	case NETDEV_REGISTER:
>> >+		new_stat = kzalloc(sizeof(struct dm_hw_stat_delta), GFP_KERNEL);
>> >+
>> >+		if (!new_stat)
>> >+			goto out;
>> >+
>> >+		new_stat->dev = dev;
>> >+		INIT_RCU_HEAD(&new_stat->rcu);
>> >+		spin_lock(&trace_state_lock);
>> >+		list_add_rcu(&new_stat->list, &hw_stats_list);
>> >+		spin_unlock(&trace_state_lock);
>> >+		break;
>> >+	case NETDEV_UNREGISTER:
>> >+		rcu_read_lock();
>> >+		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
>> >+			if (new_stat->dev == dev)
>> >+				new_stat->dev = NULL;
>> >+				break;
>> >+		}
>> >+		rcu_read_unlock();
>> >+		break;
>> >+	}
>> >+out:
>> >+	return NOTIFY_DONE;
>> >+}
>> > 
>> > static struct genl_ops dropmon_ops[] = {
>> > 	{
>> >@@ -220,6 +318,10 @@ static struct genl_ops dropmon_ops[] = {
>> > 	},
>> > };
>> > 
>> >+static struct notifier_block dropmon_net_notifier = {
>> >+	.notifier_call = dropmon_net_event
>> >+};
>> >+
>> > static int __init init_net_drop_monitor(void)
>> > {
>> > 	int cpu;
>> >@@ -243,12 +345,18 @@ static int __init init_net_drop_monitor(void)
>> > 		ret = genl_register_ops(&net_drop_monitor_family,
>> > 					&dropmon_ops[i]);
>> > 		if (ret) {
>> >-			printk(KERN_CRIT "failed to register operation %d\n",
>> >+			printk(KERN_CRIT "Failed to register operation %d\n",
>> > 				dropmon_ops[i].cmd);
>> > 			goto out_unreg;
>> > 		}
>> > 	}
>> > 
>> >+	rc = register_netdevice_notifier(&dropmon_net_notifier);
>> >+	if (rc < 0) {
>> >+		printk(KERN_CRIT "Failed to register netdevice notifier\n");
>> >+		goto out_unreg;
>> >+	}
>> >+
>> > 	rc = 0;
>> > 
>> > 	for_each_present_cpu(cpu) {
>> >@@ -259,6 +367,7 @@ static int __init init_net_drop_monitor(void)
>> > 		data->send_timer.data = cpu;
>> > 		data->send_timer.function = sched_send_work;
>> > 	}
>> >+
>> > 	goto out;
>> > 
>> > out_unreg:
>> >diff --git a/net/core/net-traces.c b/net/core/net-traces.c
>> >index c8fb456..b07b25b 100644
>> >--- a/net/core/net-traces.c
>> >+++ b/net/core/net-traces.c
>> >@@ -20,6 +20,7 @@
>> > #include <linux/netlink.h>
>> > #include <linux/net_dropmon.h>
>> > #include <trace/skb.h>
>> >+#include <trace/napi.h>
>> > 
>> > #include <asm/unaligned.h>
>> > #include <asm/bitops.h>
>> >@@ -27,3 +28,6 @@
>> > 
>> > DEFINE_TRACE(kfree_skb);
>> > EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb);
>> >+
>> >+DEFINE_TRACE(napi_poll);
>> >+EXPORT_TRACEPOINT_SYMBOL_GPL(napi_poll);
>> >diff --git a/net/core/netpoll.c b/net/core/netpoll.c
>> >index b5873bd..446d3ba 100644
>> >--- a/net/core/netpoll.c
>> >+++ b/net/core/netpoll.c
>> >@@ -24,6 +24,7 @@
>> > #include <net/tcp.h>
>> > #include <net/udp.h>
>> > #include <asm/unaligned.h>
>> >+#include <trace/napi.h>
>> > 
>> > /*
>> >  * We maintain a small pool of fully-sized skbs, to make sure the
>> >@@ -137,6 +138,7 @@ static int poll_one_napi(struct netpoll_info *npinfo,
>> > 	set_bit(NAPI_STATE_NPSVC, &napi->state);
>> > 
>> > 	work = napi->poll(napi, budget);
>> >+	trace_napi_poll(napi->dev);
>> > 
>> > 	clear_bit(NAPI_STATE_NPSVC, &napi->state);
>> > 	atomic_dec(&trapped);
>> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - May 15, 2009, 11:40 a.m.
On Fri, May 15, 2009 at 07:15:30AM -0400, Neil Horman wrote:
> On Fri, May 15, 2009 at 11:12:14AM +0000, Jarek Poplawski wrote:
> > On Fri, May 15, 2009 at 07:01:41AM -0400, Neil Horman wrote:
> > > On Fri, May 15, 2009 at 05:49:47AM +0000, Jarek Poplawski wrote:
> > ...
> > > > IMHO it looks worse now. rcu_read_lock() suggests it's a read side,
> > > > and spin_lock(&trace_state_lock) protects something else.
> > > > 
> > > the read lock is required (according to the comments for the list loop
> > > primitive) to protect against the embedded mutation primitive, so its required.
> > > I understand that its a bit counterintuitive, but intuition takes a backseat to
> > > functionality. :)
> > > Neil
> > > 
> > 
> > I guess, you missed:
> > 
> > > Looks good from an RCU viewpoint!
> > > 
> > > Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > for the previous version...
> > 
> I didn't, our comments passed in flight.  Nevertheless, I'm not sure what this
> adds (other than additional overhead), which I agree is bad and so might should
> be removed, but there are some outstanding questions regarding if it is needed
> in relation to the list primitives I'm using here.  According to Eric,
> list_for_each_entry_safe might be less intrusive here, and I'm trying to figure
> out if I agree. :)
> Neil

Paul "acked" two variants, and Eric prefers one of them. Adding
rcu_read_lock() makes sense only "If this code was shared between the
read side and the update side". Anyway it would need additional
comment. Otherwise it's misleading (but not wrong). And, since Paul
reviewed this, it's definitely not needed here because Paul is simply
always right ;-)

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney - May 16, 2009, 12:07 a.m.
On Fri, May 15, 2009 at 11:40:29AM +0000, Jarek Poplawski wrote:
> On Fri, May 15, 2009 at 07:15:30AM -0400, Neil Horman wrote:
> > On Fri, May 15, 2009 at 11:12:14AM +0000, Jarek Poplawski wrote:
> > > On Fri, May 15, 2009 at 07:01:41AM -0400, Neil Horman wrote:
> > > > On Fri, May 15, 2009 at 05:49:47AM +0000, Jarek Poplawski wrote:
> > > ...
> > > > > IMHO it looks worse now. rcu_read_lock() suggests it's a read side,
> > > > > and spin_lock(&trace_state_lock) protects something else.
> > > > > 
> > > > the read lock is required (according to the comments for the list loop
> > > > primitive) to protect against the embedded mutation primitive, so its required.
> > > > I understand that its a bit counterintuitive, but intuition takes a backseat to
> > > > functionality. :)
> > > > Neil
> > > > 
> > > 
> > > I guess, you missed:
> > > 
> > > > Looks good from an RCU viewpoint!
> > > > 
> > > > Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > 
> > > for the previous version...
> > > 
> > I didn't, our comments passed in flight.  Nevertheless, I'm not sure what this
> > adds (other than additional overhead), which I agree is bad and so might should
> > be removed, but there are some outstanding questions regarding if it is needed
> > in relation to the list primitives I'm using here.  According to Eric,
> > list_for_each_entry_safe might be less intrusive here, and I'm trying to figure
> > out if I agree. :)
> > Neil
> 
> Paul "acked" two variants, and Eric prefers one of them. Adding
> rcu_read_lock() makes sense only "If this code was shared between the
> read side and the update side". Anyway it would need additional
> comment. Otherwise it's misleading (but not wrong). And, since Paul
> reviewed this, it's definitely not needed here because Paul is simply
> always right ;-)

Much as I appreciate the vote of confidence...  ;-)

I believe that both versions work correctly, and that the difference
is therefore a matter of style.  My mild preference would be to use
rcu_read_lock() only if there was some possibility that a reader (some
task not holding the update-side lock) would execute this code.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/linux/net_dropmon.h b/include/linux/net_dropmon.h
index 0217fb8..9addc62 100644
--- a/include/linux/net_dropmon.h
+++ b/include/linux/net_dropmon.h
@@ -8,6 +8,13 @@  struct net_dm_drop_point {
 	__u32 count;
 };
 
+#define is_drop_point_hw(x) do {\
+	int ____i, ____j;\
+	for (____i = 0; ____i < 8; i ____i++)\
+		____j |= x[____i];\
+	____j;\
+} while (0)
+
 #define NET_DM_CFG_VERSION  0
 #define NET_DM_CFG_ALERT_COUNT  1
 #define NET_DM_CFG_ALERT_DELAY 2
diff --git a/include/trace/napi.h b/include/trace/napi.h
new file mode 100644
index 0000000..7c39eb7
--- /dev/null
+++ b/include/trace/napi.h
@@ -0,0 +1,11 @@ 
+#ifndef _TRACE_NAPI_H_
+#define _TRACE_NAPI_H_
+
+#include <linux/skbuff.h>
+#include <linux/tracepoint.h>
+
+DECLARE_TRACE(napi_poll,
+	TP_PROTO(struct napi_struct *napi),
+	TP_ARGS(napi));
+
+#endif
diff --git a/net/core/dev.c b/net/core/dev.c
index 637ea71..165979c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -126,6 +126,7 @@ 
 #include <linux/in.h>
 #include <linux/jhash.h>
 #include <linux/random.h>
+#include <trace/napi.h>
 
 #include "net-sysfs.h"
 
@@ -2763,8 +2764,10 @@  static void net_rx_action(struct softirq_action *h)
 		 * accidently calling ->poll() when NAPI is not scheduled.
 		 */
 		work = 0;
-		if (test_bit(NAPI_STATE_SCHED, &n->state))
+		if (test_bit(NAPI_STATE_SCHED, &n->state)) {
 			work = n->poll(n, weight);
+			trace_napi_poll(n);
+		}
 
 		WARN_ON_ONCE(work > weight);
 
diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index 2797b71..f9130d5 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -22,8 +22,10 @@ 
 #include <linux/timer.h>
 #include <linux/bitops.h>
 #include <net/genetlink.h>
+#include <net/netevent.h>
 
 #include <trace/skb.h>
+#include <trace/napi.h>
 
 #include <asm/unaligned.h>
 
@@ -38,7 +40,8 @@  static void send_dm_alert(struct work_struct *unused);
  * and the work handle that will send up
  * netlink alerts
  */
-struct sock *dm_sock;
+static int trace_state = TRACE_OFF;
+static spinlock_t trace_state_lock = SPIN_LOCK_UNLOCKED;
 
 struct per_cpu_dm_data {
 	struct work_struct dm_alert_work;
@@ -47,6 +50,13 @@  struct per_cpu_dm_data {
 	struct timer_list send_timer;
 };
 
+struct dm_hw_stat_delta {
+	struct net_device *dev;
+	struct list_head list;
+	struct rcu_head rcu;
+	unsigned long last_drop_val;
+};
+
 static struct genl_family net_drop_monitor_family = {
 	.id             = GENL_ID_GENERATE,
 	.hdrsize        = 0,
@@ -59,7 +69,8 @@  static DEFINE_PER_CPU(struct per_cpu_dm_data, dm_cpu_data);
 
 static int dm_hit_limit = 64;
 static int dm_delay = 1;
-
+static unsigned long dm_hw_check_delta = 2*HZ;
+static LIST_HEAD(hw_stats_list);
 
 static void reset_per_cpu_data(struct per_cpu_dm_data *data)
 {
@@ -115,7 +126,7 @@  static void sched_send_work(unsigned long unused)
 	schedule_work(&data->dm_alert_work);
 }
 
-static void trace_kfree_skb_hit(struct sk_buff *skb, void *location)
+static void trace_drop_common(struct sk_buff *skb, void *location)
 {
 	struct net_dm_alert_msg *msg;
 	struct nlmsghdr *nlh;
@@ -159,26 +170,81 @@  out:
 	return;
 }
 
+static void trace_kfree_skb_hit(struct sk_buff *skb, void *location)
+{
+	trace_drop_common(skb, location);
+}
+
+static void trace_napi_poll_hit(struct napi_struct *napi)
+{
+	struct dm_hw_stat_delta *new_stat;
+
+	/*
+	 * Ratelimit our check time to dm_hw_check_delta jiffies
+	 */
+	if (!time_after(jiffies, napi->dev->last_rx + dm_hw_check_delta))
+		return;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
+		if ((new_stat->dev == napi->dev)  &&
+		    (napi->dev->stats.rx_dropped != new_stat->last_drop_val)) {
+			trace_drop_common(NULL, NULL);
+			new_stat->last_drop_val = napi->dev->stats.rx_dropped;
+			break;
+		}
+	}
+	rcu_read_unlock();
+}
+
+
+static void free_dm_hw_stat(struct rcu_head *head)
+{
+	struct dm_hw_stat_delta *n;
+	n = container_of(head, struct dm_hw_stat_delta, rcu);
+	kfree(n);
+}
+
 static int set_all_monitor_traces(int state)
 {
 	int rc = 0;
+	struct dm_hw_stat_delta *new_stat = NULL;
+
+	spin_lock(&trace_state_lock);
 
 	switch (state) {
 	case TRACE_ON:
 		rc |= register_trace_kfree_skb(trace_kfree_skb_hit);
+		rc |= register_trace_napi_poll(trace_napi_poll_hit);
 		break;
 	case TRACE_OFF:
 		rc |= unregister_trace_kfree_skb(trace_kfree_skb_hit);
+		rc |= unregister_trace_napi_poll(trace_napi_poll_hit);
 
 		tracepoint_synchronize_unregister();
+
+		/*
+		 * Clean the device list
+		 */
+		rcu_read_lock();
+		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
+			if (new_stat->dev == NULL) {
+				list_del_rcu(&new_stat->list);
+				call_rcu(&new_stat->rcu, free_dm_hw_stat);
+			}
+		}
+		rcu_read_unlock();
 		break;
 	default:
 		rc = 1;
 		break;
 	}
 
+	spin_unlock(&trace_state_lock);
+
 	if (rc)
 		return -EINPROGRESS;
+	trace_state = state;
 	return rc;
 }
 
@@ -204,6 +270,38 @@  static int net_dm_cmd_trace(struct sk_buff *skb,
 	return -ENOTSUPP;
 }
 
+static int dropmon_net_event(struct notifier_block *ev_block,
+			unsigned long event, void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct dm_hw_stat_delta *new_stat = NULL;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		new_stat = kzalloc(sizeof(struct dm_hw_stat_delta), GFP_KERNEL);
+
+		if (!new_stat)
+			goto out;
+
+		new_stat->dev = dev;
+		INIT_RCU_HEAD(&new_stat->rcu);
+		spin_lock(&trace_state_lock);
+		list_add_rcu(&new_stat->list, &hw_stats_list);
+		spin_unlock(&trace_state_lock);
+		break;
+	case NETDEV_UNREGISTER:
+		rcu_read_lock();
+		list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
+			if (new_stat->dev == dev)
+				new_stat->dev = NULL;
+				break;
+		}
+		rcu_read_unlock();
+		break;
+	}
+out:
+	return NOTIFY_DONE;
+}
 
 static struct genl_ops dropmon_ops[] = {
 	{
@@ -220,6 +318,10 @@  static struct genl_ops dropmon_ops[] = {
 	},
 };
 
+static struct notifier_block dropmon_net_notifier = {
+	.notifier_call = dropmon_net_event
+};
+
 static int __init init_net_drop_monitor(void)
 {
 	int cpu;
@@ -243,12 +345,18 @@  static int __init init_net_drop_monitor(void)
 		ret = genl_register_ops(&net_drop_monitor_family,
 					&dropmon_ops[i]);
 		if (ret) {
-			printk(KERN_CRIT "failed to register operation %d\n",
+			printk(KERN_CRIT "Failed to register operation %d\n",
 				dropmon_ops[i].cmd);
 			goto out_unreg;
 		}
 	}
 
+	rc = register_netdevice_notifier(&dropmon_net_notifier);
+	if (rc < 0) {
+		printk(KERN_CRIT "Failed to register netdevice notifier\n");
+		goto out_unreg;
+	}
+
 	rc = 0;
 
 	for_each_present_cpu(cpu) {
@@ -259,6 +367,7 @@  static int __init init_net_drop_monitor(void)
 		data->send_timer.data = cpu;
 		data->send_timer.function = sched_send_work;
 	}
+
 	goto out;
 
 out_unreg:
diff --git a/net/core/net-traces.c b/net/core/net-traces.c
index c8fb456..b07b25b 100644
--- a/net/core/net-traces.c
+++ b/net/core/net-traces.c
@@ -20,6 +20,7 @@ 
 #include <linux/netlink.h>
 #include <linux/net_dropmon.h>
 #include <trace/skb.h>
+#include <trace/napi.h>
 
 #include <asm/unaligned.h>
 #include <asm/bitops.h>
@@ -27,3 +28,6 @@ 
 
 DEFINE_TRACE(kfree_skb);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb);
+
+DEFINE_TRACE(napi_poll);
+EXPORT_TRACEPOINT_SYMBOL_GPL(napi_poll);
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index b5873bd..446d3ba 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -24,6 +24,7 @@ 
 #include <net/tcp.h>
 #include <net/udp.h>
 #include <asm/unaligned.h>
+#include <trace/napi.h>
 
 /*
  * We maintain a small pool of fully-sized skbs, to make sure the
@@ -137,6 +138,7 @@  static int poll_one_napi(struct netpoll_info *npinfo,
 	set_bit(NAPI_STATE_NPSVC, &napi->state);
 
 	work = napi->poll(napi, budget);
+	trace_napi_poll(napi->dev);
 
 	clear_bit(NAPI_STATE_NPSVC, &napi->state);
 	atomic_dec(&trapped);