diff mbox

Improved network performance by balancing Rx against other work

Message ID 87tysfu05d.wl%peterc@chubb.wattle.id.au
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Peter Chubb March 17, 2010, 2:55 a.m. UTC
Hi,
	I don't know how many people caught my talk at LCA, but the
slides are now up at http://www.lca2010.org.nz/slides/50036.pdf

The general approach is to restrict the work done in the Rx-side
processing to just 32 or so packets at a time then call
sys_sched_yield() to allow other system processing to get a look in.
Currently, NAPI processing happens in soft IRQ context, and much of it
with interrupts off.  Moreover, when a particular poll routine is over
budget, the softirq is reraised --- so the softirq thread keeps
running until the scheduler (which under loaded conditions sees
softirq as a CPU-intensive thread) decides to preempt it.  By this
time, more packets are queued than the upper levels can handle in a
reasonable time, killing overall system performance.  In addition,
because so much runs with interrupts disabled, real-time performance
sucks.

The results in the talk were for UDP echo, which is a bit easier to
analyse than NFS.

By running the interrupt handler as a separate thread running
SCHED_NORMAL, and getting rid of all deferred work (implicitly
coalescing interrupts) I see around 15% improvement in NFSstone.
What's more, under overload conditions, the standard Linux kernel's
performance droops markedly with increasing load; with this patch, it
remains approximately constant.

Against a 3GHz Celeron, the standard kernel gives a peak of around
8000 NFS ops/sec; as the load increases this drops a little.  With this
patch I see 10 000 ops/sec, and less drooping.

NFS over UDP
requests/sec    Completed	Completed
Generated	Std Kernel	My patch

5000		5000		5000
10000		8169		10000
15000		7938		10649
20000		7579		11084
25000		7541		10133

Median Latencies at low load are about the same, at around 300usec
(max 7ms).  At 10000 applied ops per second, with my patch the median
per-request latency is around 53ms, with the standard kernel 200ms.

CPU usage is around the same in both cases.

I've appended an updated patch (the one I used to present the results
at LCA2010 was extremely hacky, this one is just a bit hacky).

I'm not expecting this to be applied; but please try to reproduce my
results.  To do this `properly' is going to mean fairly major surgery
to the NAPI layer, and I'm not yet sure of the right way to do that.
I'd also like a sanity check on my analysis.


Signed-off-by: Peter Chubb <peter.chubb@nicta.com.au>
---
 drivers/net/e1000/e1000.h      |    2 
 drivers/net/e1000/e1000_main.c |  115 ++++++++++++++++++++++++-----------------
 kernel/fork.c                  |    2 
 kernel/sched.c                 |    1 
 4 files changed, 73 insertions(+), 47 deletions(-)


--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller March 22, 2010, 3:21 a.m. UTC | #1
From: Peter Chubb <peter.chubb@nicta.com.au>
Date: Wed, 17 Mar 2010 13:55:58 +1100

> The general approach is to restrict the work done in the Rx-side
> processing to just 32 or so packets at a time then call
> sys_sched_yield() to allow other system processing to get a look in.
> Currently, NAPI processing happens in soft IRQ context, and much of it
> with interrupts off.

This is a deficiency in the locking done by such drivers.  Many
drivers lock properly and do not disable hardware interrupts during
NAPI processing.  Not only is this more efficient, it also makes
the driver more profilable.  For example, on cpus with only timer
based profiling everything done in NAPI context can be seen.

> In addition, because so much runs with interrupts disabled,
> real-time performance sucks.

Please respin your analysis to be done with a sane driver, one that
does not disable hardware interrupts.  Doing so is a driver, rather
than a fundamental, issue of NAPI.

Avoiding hardware interrupt disabling also has another huge benefit,
it means that the transmit path of the driver need not disable
hardware interrupts either since that already runs in software
interrupt disabled context.

Your analysis was essentially done with broken driver(s), so I think
I can stop reading here.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Chubb March 22, 2010, 4:04 a.m. UTC | #2
>>>>> "David" == David Miller <davem@davemloft.net> writes:

David> From: Peter Chubb <peter.chubb@nicta.com.au> Date: Wed, 17 Mar
David> 2010 13:55:58 +1100

>> The general approach is to restrict the work done in the Rx-side
>> processing to just 32 or so packets at a time then call
>> sys_sched_yield() to allow other system processing to get a look
>> in.  Currently, NAPI processing happens in soft IRQ context, and
>> much of it with interrupts off.

David> This is a deficiency in the locking done by such drivers.  Many
David> drivers lock properly and do not disable hardware interrupts
David> during NAPI processing.  Not only is this more efficient, it
David> also makes the driver more profilable.  For example, on cpus
David> with only timer based profiling everything done in NAPI context
David> can be seen.

Actually, the e1000 does not appear to disable interrupts during NAPI
processing.  However, softIRQ processing is still not being preempted
by a real-time process that wakes up.

The performance issue is not this, however, but the receive livelock problem
caused by too many packets being queued to higher layers --- more than
can be handled before the next interrupt causes more packets to be
batched up.  When the NAPI budget is consumed, but there are more packets
to handle, the NAPI layer currently reraises the softIRQ --- which
fires before anything else gets a go on the processor.


Anyway, if you think the e1000 driver is broken, I'll try with a
different one.  I had a look at the Broadcom tg3, but it looks too
hard to modify; so I'm now looking at the Realtek r8169 driver.  It
might take a few days.

PeterC
--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia
--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 22, 2010, 4:11 a.m. UTC | #3
From: Peter Chubb <peter.chubb@nicta.com.au>
Date: Mon, 22 Mar 2010 15:04:24 +1100

>>>>>> "David" == David Miller <davem@davemloft.net> writes:
> 
> David> From: Peter Chubb <peter.chubb@nicta.com.au> Date: Wed, 17 Mar
> David> 2010 13:55:58 +1100
> 
>>> The general approach is to restrict the work done in the Rx-side
>>> processing to just 32 or so packets at a time then call
>>> sys_sched_yield() to allow other system processing to get a look
>>> in.  Currently, NAPI processing happens in soft IRQ context, and
>>> much of it with interrupts off.
> 
> David> This is a deficiency in the locking done by such drivers.  Many
> David> drivers lock properly and do not disable hardware interrupts
> David> during NAPI processing.  Not only is this more efficient, it
> David> also makes the driver more profilable.  For example, on cpus
> David> with only timer based profiling everything done in NAPI context
> David> can be seen.
> 
> Actually, the e1000 does not appear to disable interrupts during NAPI
> processing.

So you don't know the fundamental aspects of what you were actually
analyzing? :-)

> However, softIRQ processing is still not being preempted by a
> real-time process that wakes up.

I thought softirqs ran as threads in the -rt kernel, why doesn't
preemption work properly for those threads?

> I had a look at the Broadcom tg3, but it looks too
> hard to modify;

It's one of the best drivers in the locking area.

It doesn't take any locks at all in it's hardware interrupt handler.

It doesn't take any locks at all for RX packet processing.

And it only takes a lock for TX processing very briefly in one
specific case, when we need to wake up the TX queue because it became
full and was stopped and we need to wake it up in tg3_tx()
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Chubb March 22, 2010, 4:19 a.m. UTC | #4
>>>>> "David" == David Miller <davem@davemloft.net> writes:

David> From: Peter Chubb <peter.chubb@nicta.com.au> Date: Mon, 22 Mar
David> 2010 15:04:24 +1100

>>>>>>> "David" == David Miller <davem@davemloft.net> writes:
>>
David> From: Peter Chubb <peter.chubb@nicta.com.au> Date: Wed, 17 Mar

>> However, softIRQ processing is still not being preempted by a
>> real-time process that wakes up.

David> I thought softirqs ran as threads in the -rt kernel, why
David> doesn't preemption work properly for those threads?

I'm using the standard kernel here -- softIRQs run as threads
scheduled as SCHED_NORMAL, but because they're in the kernel, they're
not preemptible without an explicit preemption point, as far as I can
tell.  So hardware interrupts will run, and the budget/work done
mechanism will ensure that other softIRQs and in-kernel real-time
threads run, but userspace doesn't seem to get a look in.

>> I had a look at the Broadcom tg3, but it looks too hard to modify;

David> It's one of the best drivers in the locking area.

It looked good from that aspect, but used too many features of NAPI
for me to modify quickly, and still have a simple and
easy-to-understand patch.

Anyway, I'm intending to try to reproduce the results with a different
driver, as I said.

Peter C

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia
--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 22, 2010, 4:24 a.m. UTC | #5
From: Peter Chubb <peter.chubb@nicta.com.au>
Date: Mon, 22 Mar 2010 15:19:54 +1100

> David> It's one of the best drivers in the locking area.
> 
> It looked good from that aspect, but used too many features of NAPI
> for me to modify quickly, and still have a simple and
> easy-to-understand patch.

What's so complicated about it?  All of the NAPI logic is
locks into a wrapper function that encapsulates all of
the top-level budget and interrupt masking logic.

> Anyway, I'm intending to try to reproduce the results with a different
> driver, as I said.

Well, you do so at your own peril.  tg3 is the best, whereas
r8169 is really awful and people report NAPI wedges with it
all the time.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin March 22, 2010, 1:27 p.m. UTC | #6
On Wed, Mar 17, 2010 at 01:55:58PM +1100, Peter Chubb wrote:
> +static int e1000_intr_thread(void *data)
> +{
> +	struct net_device *netdev = data;
> +	struct e1000_adapter *adapter = netdev_priv(netdev);
> +	const int budget = 32; // FIXME should be auto-tuneable
> +	int tx_clean_complete = 0, work_done = 0;
> +
> +	while(!e1000_wait_for_intr(adapter)) {
> +		do {
> +			work_done = 0;
> +
> +			tx_clean_complete = e1000_clean_tx_irq(adapter, &adapter->tx_ring[0]);
> +			adapter->clean_rx(adapter, &adapter->rx_ring[0], &work_done, budget);
> +			if (!tx_clean_complete)
> +				work_done = budget;
> +
> +			if (work_done == budget) {
> +				/*
> +				 * Give up the rest of the timeslice to allow
> +				 * userspace to make forward progress
> +				 */
> +				sys_sched_yield();

sched_yield literally has undefined semantics if you are in SCHED_OTHER
priority (which your thread is). sched_yield use in userspace and even
kernel has caused headaches whenever the scheduler changes
significantly.

Please use anything but sched_yield. cond_resched() looks appropriate
here, it gives a simple resched point on !PREEMPT kernels. Then the
scheduler nice levels and newer resource controls should give sufficient
flexibility to fine tune CPU allocation after that.


> +			}
> +		} while (work_done == budget);
> +
> +		/* If budget not fully consumed, wait for an interrupt */
> +		adapter->last_icr = 0;
> +		if (likely(adapter->itr_setting & 3))
> +			e1000_set_itr(adapter);
> +		if (!test_bit(__E1000_DOWN, &adapter->flags))
> +			e1000_irq_enable(adapter);
> +	}
> +	return 0;
> +}

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-2.6-peterc/drivers/net/e1000/e1000.h
===================================================================
--- linux-2.6-peterc.orig/drivers/net/e1000/e1000.h
+++ linux-2.6-peterc/drivers/net/e1000/e1000.h
@@ -287,6 +287,8 @@  struct e1000_adapter {
 			     int cleaned_count);
 	struct e1000_rx_ring *rx_ring;      /* One per active queue */
 	struct napi_struct napi;
+	struct task_struct *irq_thread;
+	u32	last_icr;
 
 	int num_tx_queues;
 	int num_rx_queues;
Index: linux-2.6-peterc/drivers/net/e1000/e1000_main.c
===================================================================
--- linux-2.6-peterc.orig/drivers/net/e1000/e1000_main.c
+++ linux-2.6-peterc/drivers/net/e1000/e1000_main.c
@@ -28,6 +28,8 @@ 
 
 #include "e1000.h"
 #include <net/ip6_checksum.h>
+#include <linux/kthread.h>
+#include <linux/syscalls.h> /* sys_sched_yield() */
 
 char e1000_driver_name[] = "e1000";
 static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
@@ -131,6 +133,7 @@  static struct net_device_stats * e1000_g
 static int e1000_change_mtu(struct net_device *netdev, int new_mtu);
 static int e1000_set_mac(struct net_device *netdev, void *p);
 static irqreturn_t e1000_intr(int irq, void *data);
+static int e1000_intr_thread(void *data);
 static bool e1000_clean_tx_irq(struct e1000_adapter *adapter,
 			       struct e1000_tx_ring *tx_ring);
 static int e1000_clean(struct napi_struct *napi, int budget);
@@ -269,6 +272,12 @@  static int e1000_request_irq(struct e100
 		        "Unable to allocate interrupt Error: %d\n", err);
 	}
 
+	adapter->irq_thread = kthread_create(e1000_intr_thread,
+					     netdev,
+					     "irq/%d-%s",
+					     adapter->pdev->irq,
+					     netdev->name);
+	get_task_struct(adapter->irq_thread);
 	return err;
 }
 
@@ -277,6 +286,9 @@  static void e1000_free_irq(struct e1000_
 	struct net_device *netdev = adapter->netdev;
 
 	free_irq(adapter->pdev->irq, netdev);
+	kthread_stop(adapter->irq_thread);
+	put_task_struct(adapter->irq_thread);
+
 }
 
 /**
@@ -396,8 +408,6 @@  int e1000_up(struct e1000_adapter *adapt
 
 	clear_bit(__E1000_DOWN, &adapter->flags);
 
-	napi_enable(&adapter->napi);
-
 	e1000_irq_enable(adapter);
 
 	netif_wake_queue(adapter->netdev);
@@ -495,8 +505,6 @@  void e1000_down(struct e1000_adapter *ad
 	E1000_WRITE_FLUSH();
 	msleep(10);
 
-	napi_disable(&adapter->napi);
-
 	e1000_irq_disable(adapter);
 
 	del_timer_sync(&adapter->tx_fifo_stall_timer);
@@ -888,7 +896,6 @@  static int __devinit e1000_probe(struct 
 	netdev->netdev_ops = &e1000_netdev_ops;
 	e1000_set_ethtool_ops(netdev);
 	netdev->watchdog_timeo = 5 * HZ;
-	netif_napi_add(netdev, &adapter->napi, e1000_clean, 64);
 
 	strncpy(netdev->name, pci_name(pdev), sizeof(netdev->name) - 1);
 
@@ -1288,8 +1295,6 @@  static int e1000_open(struct net_device 
 	/* From here on the code is the same as e1000_up() */
 	clear_bit(__E1000_DOWN, &adapter->flags);
 
-	napi_enable(&adapter->napi);
-
 	e1000_irq_enable(adapter);
 
 	netif_start_queue(netdev);
@@ -3347,6 +3352,55 @@  void e1000_update_stats(struct e1000_ada
 	spin_unlock_irqrestore(&adapter->stats_lock, flags);
 }
 
+static int
+e1000_wait_for_intr(struct e1000_adapter *adapter)
+{
+	while (!kthread_should_stop()) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (adapter->last_icr) {
+			__set_current_state(TASK_RUNNING);
+			return 0;
+		}
+		schedule();
+	}
+	return -1;
+}
+
+static int e1000_intr_thread(void *data)
+{
+	struct net_device *netdev = data;
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	const int budget = 32; // FIXME should be auto-tuneable
+	int tx_clean_complete = 0, work_done = 0;
+
+	while(!e1000_wait_for_intr(adapter)) {
+		do {
+			work_done = 0;
+
+			tx_clean_complete = e1000_clean_tx_irq(adapter, &adapter->tx_ring[0]);
+			adapter->clean_rx(adapter, &adapter->rx_ring[0], &work_done, budget);
+			if (!tx_clean_complete)
+				work_done = budget;
+
+			if (work_done == budget) {
+				/*
+				 * Give up the rest of the timeslice to allow
+				 * userspace to make forward progress
+				 */
+				sys_sched_yield();
+			}
+		} while (work_done == budget);
+
+		/* If budget not fully consumed, wait for an interrupt */
+		adapter->last_icr = 0;
+		if (likely(adapter->itr_setting & 3))
+			e1000_set_itr(adapter);
+		if (!test_bit(__E1000_DOWN, &adapter->flags))
+			e1000_irq_enable(adapter);
+	}
+	return 0;
+}
+
 /**
  * e1000_intr - Interrupt Handler
  * @irq: interrupt number
@@ -3374,49 +3428,12 @@  static irqreturn_t e1000_intr(int irq, v
 	ew32(IMC, ~0);
 	E1000_WRITE_FLUSH();
 
-	if (likely(napi_schedule_prep(&adapter->napi))) {
-		adapter->total_tx_bytes = 0;
-		adapter->total_tx_packets = 0;
-		adapter->total_rx_bytes = 0;
-		adapter->total_rx_packets = 0;
-		__napi_schedule(&adapter->napi);
-	} else {
-		/* this really should not happen! if it does it is basically a
-		 * bug, but not a hard error, so enable ints and continue */
-		if (!test_bit(__E1000_DOWN, &adapter->flags))
-			e1000_irq_enable(adapter);
-	}
+	adapter->last_icr = icr;
+	wake_up_process(adapter->irq_thread);
 
 	return IRQ_HANDLED;
 }
 
-/**
- * e1000_clean - NAPI Rx polling callback
- * @adapter: board private structure
- **/
-static int e1000_clean(struct napi_struct *napi, int budget)
-{
-	struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter, napi);
-	int tx_clean_complete = 0, work_done = 0;
-
-	tx_clean_complete = e1000_clean_tx_irq(adapter, &adapter->tx_ring[0]);
-
-	adapter->clean_rx(adapter, &adapter->rx_ring[0], &work_done, budget);
-
-	if (!tx_clean_complete)
-		work_done = budget;
-
-	/* If budget not fully consumed, exit the polling mode */
-	if (work_done < budget) {
-		if (likely(adapter->itr_setting & 3))
-			e1000_set_itr(adapter);
-		napi_complete(napi);
-		if (!test_bit(__E1000_DOWN, &adapter->flags))
-			e1000_irq_enable(adapter);
-	}
-
-	return work_done;
-}
 
 /**
  * e1000_clean_tx_irq - Reclaim resources after transmit completes
Index: linux-2.6-peterc/kernel/fork.c
===================================================================
--- linux-2.6-peterc.orig/kernel/fork.c
+++ linux-2.6-peterc/kernel/fork.c
@@ -178,6 +178,8 @@  void __put_task_struct(struct task_struc
 		free_task(tsk);
 }
 
+EXPORT_SYMBOL(__put_task_struct);
+
 /*
  * macro override instead of weak attribute alias, to workaround
  * gcc 4.1.0 and 4.1.1 bugs with weak attribute and empty functions.
Index: linux-2.6-peterc/kernel/sched.c
===================================================================
--- linux-2.6-peterc.orig/kernel/sched.c
+++ linux-2.6-peterc/kernel/sched.c
@@ -4946,6 +4946,7 @@  SYSCALL_DEFINE0(sched_yield)
 
 	return 0;
 }
+EXPORT_SYMBOL(sys_sched_yield);
 
 static inline int should_resched(void)
 {