diff mbox

net: less interrupt masking in NAPI

Message ID 1414937973.31792.37.camel@edumazet-glaptop2.roam.corp.google.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Nov. 2, 2014, 2:19 p.m. UTC
From: Eric Dumazet <edumazet@google.com>

net_rx_action() can mask irqs a single time to transfert sd->poll_list
into a private list, for a very short duration.

Then, napi_complete() can avoid masking irqs again,
and net_rx_action() only needs to mask irq again in slow path.

This patch removes 2 couples of irq mask/unmask per typical NAPI run,
more if multiple napi were triggered.

Note this also allows to give control back to caller (do_softirq())
more often, so that other softirq handlers can be called a bit earlier,
or ksoftirqd can be wakeup earlier under pressure.

This was developed while testing an alternative to RX interrupt
mitigation to reduce latencies while keeping or improving GRO
aggregation on fast NIC.

Idea is to test napi->gro_list at the end of a napi->poll() and
reschedule one NAPI poll, but after servicing a full round of
softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
is currently serviced by idle task or ksoftirqd, and resched not needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
---
 net/core/dev.c |   68 +++++++++++++++++++++++++++++------------------
 1 file changed, 43 insertions(+), 25 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Nov. 3, 2014, 5:25 p.m. UTC | #1
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 02 Nov 2014 06:19:33 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> net_rx_action() can mask irqs a single time to transfert sd->poll_list
> into a private list, for a very short duration.
> 
> Then, napi_complete() can avoid masking irqs again,
> and net_rx_action() only needs to mask irq again in slow path.
> 
> This patch removes 2 couples of irq mask/unmask per typical NAPI run,
> more if multiple napi were triggered.
> 
> Note this also allows to give control back to caller (do_softirq())
> more often, so that other softirq handlers can be called a bit earlier,
> or ksoftirqd can be wakeup earlier under pressure.
> 
> This was developed while testing an alternative to RX interrupt
> mitigation to reduce latencies while keeping or improving GRO
> aggregation on fast NIC.
> 
> Idea is to test napi->gro_list at the end of a napi->poll() and
> reschedule one NAPI poll, but after servicing a full round of
> softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
> is currently serviced by idle task or ksoftirqd, and resched not needed.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Also applied, thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yang Yingliang Dec. 3, 2014, 7:31 a.m. UTC | #2
On 2014/11/4 1:25, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sun, 02 Nov 2014 06:19:33 -0800
> 
>> From: Eric Dumazet <edumazet@google.com>
>>
>> net_rx_action() can mask irqs a single time to transfert sd->poll_list
>> into a private list, for a very short duration.
>>
>> Then, napi_complete() can avoid masking irqs again,
>> and net_rx_action() only needs to mask irq again in slow path.
>>
>> This patch removes 2 couples of irq mask/unmask per typical NAPI run,
>> more if multiple napi were triggered.
>>
>> Note this also allows to give control back to caller (do_softirq())
>> more often, so that other softirq handlers can be called a bit earlier,
>> or ksoftirqd can be wakeup earlier under pressure.
>>
>> This was developed while testing an alternative to RX interrupt
>> mitigation to reduce latencies while keeping or improving GRO
>> aggregation on fast NIC.
>>
>> Idea is to test napi->gro_list at the end of a napi->poll() and
>> reschedule one NAPI poll, but after servicing a full round of
>> softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
>> is currently serviced by idle task or ksoftirqd, and resched not needed.
>>
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> Also applied, thanks Eric.

This patch can resolve my performance problem.
Will/Can this patch queue for stable ?

Regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 3, 2014, 7:41 a.m. UTC | #3
On Wed, 2014-12-03 at 15:31 +0800, Yang Yingliang wrote:

> This patch can resolve my performance problem.
> Will/Can this patch queue for stable ?

Hmm.. please give us more details, I am surprised it can have
a big impact.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yang Yingliang Dec. 3, 2014, 9:26 a.m. UTC | #4
On 2014/12/3 15:41, Eric Dumazet wrote:
> On Wed, 2014-12-03 at 15:31 +0800, Yang Yingliang wrote:
> 
>> This patch can resolve my performance problem.
>> Will/Can this patch queue for stable ?
> 
> Hmm.. please give us more details, I am surprised it can have
> a big impact.
> 
> 
> 
> 
Before this patch, when a large network flow arrives, some other processes
response slowly or even don't response because the cpu is dealing with softirq.

After this patch, under pressure, much more softirq is doing in ksoftirqd. The other
processes be scheduled.

My system has dual core.

Regards,
Yang

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 3, 2014, 11:52 a.m. UTC | #5
On Wed, 2014-12-03 at 17:26 +0800, Yang Yingliang wrote:

> Before this patch, when a large network flow arrives, some other processes
> response slowly or even don't response because the cpu is dealing with softirq.
> 
> After this patch, under pressure, much more softirq is doing in ksoftirqd. The other
> processes be scheduled.
> 
> My system has dual core.

Which NIC driver are you using ?

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yang Yingliang Dec. 4, 2014, 2:10 a.m. UTC | #6
On 2014/12/3 19:52, Eric Dumazet wrote:
> On Wed, 2014-12-03 at 17:26 +0800, Yang Yingliang wrote:
> 
>> Before this patch, when a large network flow arrives, some other processes
>> response slowly or even don't response because the cpu is dealing with softirq.
>>
>> After this patch, under pressure, much more softirq is doing in ksoftirqd. The other
>> processes be scheduled.
>>
>> My system has dual core.
> 
> Which NIC driver are you using ?
> 
The driver we developed ourself, it's not upstreamed.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 4, 2014, 2:51 a.m. UTC | #7
On Thu, 2014-12-04 at 10:10 +0800, Yang Yingliang wrote:
> On 2014/12/3 19:52, Eric Dumazet wrote:
> > On Wed, 2014-12-03 at 17:26 +0800, Yang Yingliang wrote:
> > 
> >> Before this patch, when a large network flow arrives, some other processes
> >> response slowly or even don't response because the cpu is dealing with softirq.
> >>
> >> After this patch, under pressure, much more softirq is doing in ksoftirqd. The other
> >> processes be scheduled.
> >>
> >> My system has dual core.
> > 
> > Which NIC driver are you using ?
> > 
> The driver we developed ourself, it's not upstreamed.

Is it a NAPI driver ?

Quite frankly this patch is not a stable candidate, as it was a
performance improvement, with some regressions in a couple of drivers.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yang Yingliang Dec. 4, 2014, 3:09 a.m. UTC | #8
On 2014/12/4 10:51, Eric Dumazet wrote:
> On Thu, 2014-12-04 at 10:10 +0800, Yang Yingliang wrote:
>> On 2014/12/3 19:52, Eric Dumazet wrote:
>>> On Wed, 2014-12-03 at 17:26 +0800, Yang Yingliang wrote:
>>>
>>>> Before this patch, when a large network flow arrives, some other processes
>>>> response slowly or even don't response because the cpu is dealing with softirq.
>>>>
>>>> After this patch, under pressure, much more softirq is doing in ksoftirqd. The other
>>>> processes be scheduled.
>>>>
>>>> My system has dual core.
>>>
>>> Which NIC driver are you using ?
>>>
>> The driver we developed ourself, it's not upstreamed.
> 
> Is it a NAPI driver ?
Yes.
> 
> Quite frankly this patch is not a stable candidate, as it was a
> performance improvement, with some regressions in a couple of drivers.
> 
OK, thanks for your reply.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Dec. 4, 2014, 5:47 a.m. UTC | #9
From: Yang Yingliang <yangyingliang@huawei.com>
Date: Wed, 3 Dec 2014 15:31:50 +0800

> On 2014/11/4 1:25, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Sun, 02 Nov 2014 06:19:33 -0800
>> 
>>> From: Eric Dumazet <edumazet@google.com>
>>>
>>> net_rx_action() can mask irqs a single time to transfert sd->poll_list
>>> into a private list, for a very short duration.
>>>
>>> Then, napi_complete() can avoid masking irqs again,
>>> and net_rx_action() only needs to mask irq again in slow path.
>>>
>>> This patch removes 2 couples of irq mask/unmask per typical NAPI run,
>>> more if multiple napi were triggered.
>>>
>>> Note this also allows to give control back to caller (do_softirq())
>>> more often, so that other softirq handlers can be called a bit earlier,
>>> or ksoftirqd can be wakeup earlier under pressure.
>>>
>>> This was developed while testing an alternative to RX interrupt
>>> mitigation to reduce latencies while keeping or improving GRO
>>> aggregation on fast NIC.
>>>
>>> Idea is to test napi->gro_list at the end of a napi->poll() and
>>> reschedule one NAPI poll, but after servicing a full round of
>>> softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
>>> is currently serviced by idle task or ksoftirqd, and resched not needed.
>>>
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> 
>> Also applied, thanks Eric.
> 
> This patch can resolve my performance problem.
> Will/Can this patch queue for stable ?

Such an optimization is not appropriate for -stable.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oded Gabbay Jan. 10, 2015, 8:27 p.m. UTC | #10
On 2014/11/4 1:25, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@xxxxxxxxx>
> Date: Sun, 02 Nov 2014 06:19:33 -0800
> 
>> From: Eric Dumazet <edumazet@xxxxxxxxxx>
>>
>> net_rx_action() can mask irqs a single time to transfert sd->poll_list
>> into a private list, for a very short duration.
>>
>> Then, napi_complete() can avoid masking irqs again,
>> and net_rx_action() only needs to mask irq again in slow path.
>>
>> This patch removes 2 couples of irq mask/unmask per typical NAPI run,
>> more if multiple napi were triggered.
>>
>> Note this also allows to give control back to caller (do_softirq())
>> more often, so that other softirq handlers can be called a bit earlier,
>> or ksoftirqd can be wakeup earlier under pressure.
>>
>> This was developed while testing an alternative to RX interrupt
>> mitigation to reduce latencies while keeping or improving GRO
>> aggregation on fast NIC.
>>
>> Idea is to test napi->gro_list at the end of a napi->poll() and
>> reschedule one NAPI poll, but after servicing a full round of
>> softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
>> is currently serviced by idle task or ksoftirqd, and resched not needed.
>>
>> Signed-off-by: Eric Dumazet <edumazet@xxxxxxxxxx>
> 
> Also applied, thanks Eric.

Hi,
Unfortunately, this patch breaks my "Qualcomm Atheros AR8161 Gigabit 
Ethernet (rev 10)" Ethernet controller, which is handled by the alx 
network driver.

ogabbay@odedg-ubuntu:~$ lspci -s 01:00.0 -k
01:00.0 Ethernet controller: 
	Qualcomm Atheros AR8161 Gigabit Ethernet (rev 10)
	Subsystem: Qualcomm Atheros Device 1071
	Kernel driver in use: alx

I have this controller on a mobile platform which I use to test amdkfd 
on Kaveri AMD APU and from 3.19-rc1, the network stopped working when 
trying to transfer files through scp or nfs.

I bisected the kernel (from 3.18.0 to 3.19-rc1) and reached this patch.

Here is the log of the bisect:

git bisect start
# bad: [97bf6af1f928216fd6c5a66e8a57bfa95a659672] Linux 3.19-rc1
git bisect bad 97bf6af1f928216fd6c5a66e8a57bfa95a659672

# good: [b2776bf7149bddd1f4161f14f79520f17fc1d71d] Linux 3.18
git bisect good b2776bf7149bddd1f4161f14f79520f17fc1d71d

# bad: [70e71ca0af244f48a5dcf56dc435243792e3a495] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect bad 70e71ca0af244f48a5dcf56dc435243792e3a495

# good: [e28870f9b3e92cd3570925089c6bb789c2603bc4] Merge tag 'backlight-for-linus-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight
git bisect good e28870f9b3e92cd3570925089c6bb789c2603bc4

# bad: [450fa21942fe2c37f0c9f52d1a33bbc081eee288] sh_eth: Remove redundant alignment adjustment
git bisect bad 450fa21942fe2c37f0c9f52d1a33bbc081eee288

# bad: [5c8d19da950861d0482abc0ac3481acca34b008f] e100e: use netdev_rss_key_fill() helper
git bisect bad 5c8d19da950861d0482abc0ac3481acca34b008f

# good: [bf515fb11ab539c76d04f0e3c5216ed41f41d81f] Merge tag 'mac80211-next-for-john-2014-11-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
git bisect good bf515fb11ab539c76d04f0e3c5216ed41f41d81f

# bad: [2c99cd914d4fed9160d98849c9dd38034616768e] Merge branch 'amd-xgbe-next'
git bisect bad 2c99cd914d4fed9160d98849c9dd38034616768e

# good: [3d762a0f0ab9cb4a6b5993db3ce56c92f9f90ab2] net: dsa: Add support for reading switch registers with ethtool
git bisect good 3d762a0f0ab9cb4a6b5993db3ce56c92f9f90ab2

# bad: [8ce0c8254f15229aa99fc6c04141f28c446e5f8c] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next
git bisect bad 8ce0c8254f15229aa99fc6c04141f28c446e5f8c

# good: [f0c65567b3c1b23f79e8a49139580a3872a68d1f] Merge branch 'sunvnet-multi-tx-queue'
git bisect good f0c65567b3c1b23f79e8a49139580a3872a68d1f

# bad: [547f2735c20023d7b50a791b1b17cacb652e9237] Merge branch 'mlx4-next'
git bisect bad 547f2735c20023d7b50a791b1b17cacb652e9237

# good: [4cdb1e2e3d3495423db558d3bb7ed11d66aabce7] net: shrink struct softnet_data
git bisect good 4cdb1e2e3d3495423db558d3bb7ed11d66aabce7

# bad: [0a98455666ec87378148a1dde97f1ce5baf75a64] net/mlx4_core: Protect port type setting by mutex
git bisect bad 0a98455666ec87378148a1dde97f1ce5baf75a64

# bad: [6e8066999800d90d52af5c84ac49ebf683d14cdc] net/mlx4_core: Prevent VF from changing port configuration
git bisect bad 6e8066999800d90d52af5c84ac49ebf683d14cdc

# bad: [d75b1ade567ffab085e8adbbdacf0092d10cd09c] net: less interrupt masking in NAPI
git bisect bad d75b1ade567ffab085e8adbbdacf0092d10cd09c

# first bad commit: [d75b1ade567ffab085e8adbbdacf0092d10cd09c] 
net: less interrupt masking in NAPI

Could you please solve this issue as it renders my board quite useless.

If you need more info, please ask.

Thanks,
	Oded
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index ebf778df58cd..40be481268de 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4316,20 +4316,28 @@  static void net_rps_action_and_irq_enable(struct softnet_data *sd)
 		local_irq_enable();
 }
 
+static bool sd_has_rps_ipi_waiting(struct softnet_data *sd)
+{
+#ifdef CONFIG_RPS
+	return sd->rps_ipi_list != NULL;
+#else
+	return false;
+#endif
+}
+
 static int process_backlog(struct napi_struct *napi, int quota)
 {
 	int work = 0;
 	struct softnet_data *sd = container_of(napi, struct softnet_data, backlog);
 
-#ifdef CONFIG_RPS
 	/* Check if we have pending ipi, its better to send them now,
 	 * not waiting net_rx_action() end.
 	 */
-	if (sd->rps_ipi_list) {
+	if (sd_has_rps_ipi_waiting(sd)) {
 		local_irq_disable();
 		net_rps_action_and_irq_enable(sd);
 	}
-#endif
+
 	napi->weight = weight_p;
 	local_irq_disable();
 	while (1) {
@@ -4356,7 +4364,6 @@  static int process_backlog(struct napi_struct *napi, int quota)
 			 * We can use a plain write instead of clear_bit(),
 			 * and we dont need an smp_mb() memory barrier.
 			 */
-			list_del(&napi->poll_list);
 			napi->state = 0;
 			rps_unlock(sd);
 
@@ -4406,7 +4413,7 @@  void __napi_complete(struct napi_struct *n)
 	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
 	BUG_ON(n->gro_list);
 
-	list_del(&n->poll_list);
+	list_del_init(&n->poll_list);
 	smp_mb__before_atomic();
 	clear_bit(NAPI_STATE_SCHED, &n->state);
 }
@@ -4424,9 +4431,15 @@  void napi_complete(struct napi_struct *n)
 		return;
 
 	napi_gro_flush(n, false);
-	local_irq_save(flags);
-	__napi_complete(n);
-	local_irq_restore(flags);
+
+	if (likely(list_empty(&n->poll_list))) {
+		WARN_ON_ONCE(!test_and_clear_bit(NAPI_STATE_SCHED, &n->state));
+	} else {
+		/* If n->poll_list is not empty, we need to mask irqs */
+		local_irq_save(flags);
+		__napi_complete(n);
+		local_irq_restore(flags);
+	}
 }
 EXPORT_SYMBOL(napi_complete);
 
@@ -4520,29 +4533,28 @@  static void net_rx_action(struct softirq_action *h)
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
 	unsigned long time_limit = jiffies + 2;
 	int budget = netdev_budget;
+	LIST_HEAD(list);
+	LIST_HEAD(repoll);
 	void *have;
 
 	local_irq_disable();
+	list_splice_init(&sd->poll_list, &list);
+	local_irq_enable();
 
-	while (!list_empty(&sd->poll_list)) {
+	while (!list_empty(&list)) {
 		struct napi_struct *n;
 		int work, weight;
 
-		/* If softirq window is exhuasted then punt.
+		/* If softirq window is exhausted then punt.
 		 * Allow this to run for 2 jiffies since which will allow
 		 * an average latency of 1.5/HZ.
 		 */
 		if (unlikely(budget <= 0 || time_after_eq(jiffies, time_limit)))
 			goto softnet_break;
 
-		local_irq_enable();
 
-		/* Even though interrupts have been re-enabled, this
-		 * access is safe because interrupts can only add new
-		 * entries to the tail of this list, and only ->poll()
-		 * calls can remove this head entry from the list.
-		 */
-		n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list);
+		n = list_first_entry(&list, struct napi_struct, poll_list);
+		list_del_init(&n->poll_list);
 
 		have = netpoll_poll_lock(n);
 
@@ -4564,8 +4576,6 @@  static void net_rx_action(struct softirq_action *h)
 
 		budget -= work;
 
-		local_irq_disable();
-
 		/* Drivers must not modify the NAPI state if they
 		 * consume the entire weight.  In such cases this code
 		 * still "owns" the NAPI instance and therefore can
@@ -4573,32 +4583,40 @@  static void net_rx_action(struct softirq_action *h)
 		 */
 		if (unlikely(work == weight)) {
 			if (unlikely(napi_disable_pending(n))) {
-				local_irq_enable();
 				napi_complete(n);
-				local_irq_disable();
 			} else {
 				if (n->gro_list) {
 					/* flush too old packets
 					 * If HZ < 1000, flush all packets.
 					 */
-					local_irq_enable();
 					napi_gro_flush(n, HZ >= 1000);
-					local_irq_disable();
 				}
-				list_move_tail(&n->poll_list, &sd->poll_list);
+				list_add_tail(&n->poll_list, &repoll);
 			}
 		}
 
 		netpoll_poll_unlock(have);
 	}
+
+	if (!sd_has_rps_ipi_waiting(sd) &&
+	    list_empty(&list) &&
+	    list_empty(&repoll))
+		return;
 out:
+	local_irq_disable();
+
+	list_splice_tail_init(&sd->poll_list, &list);
+	list_splice_tail(&repoll, &list);
+	list_splice(&list, &sd->poll_list);
+	if (!list_empty(&sd->poll_list))
+		__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+
 	net_rps_action_and_irq_enable(sd);
 
 	return;
 
 softnet_break:
 	sd->time_squeeze++;
-	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 	goto out;
 }