Patchwork [net-next-2.6,be2net] remove napi in the tx path and do tx completion processing in interrupt context

login
register
mail settings
Submitter Ajit Khaparde
Date May 19, 2009, 12:10 p.m.
Message ID <20090519121056.GB26870@serverengines.com>
Download mbox | patch
Permalink /patch/27394/
State Rejected
Delegated to: David Miller
Headers show

Comments

Ajit Khaparde - May 19, 2009, 12:10 p.m.
Hi,
This patch will remove napi in tx path and do Tx compleiton processing in 
interrupt context.  This makes Tx completion processing simpler without loss of 
performance.

Thanks
Ajit

Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
---
 drivers/net/benet/be.h      |    3 +-
 drivers/net/benet/be_main.c |  114 ++++++++++++++++++++++---------------------
 2 files changed, 60 insertions(+), 57 deletions(-)
David Miller - May 19, 2009, 10:13 p.m.
From: Ajit Khaparde <ajitk@serverengines.com>
Date: Tue, 19 May 2009 17:40:58 +0530

> This patch will remove napi in tx path and do Tx compleiton
> processing in interrupt context.  This makes Tx completion
> processing simpler without loss of performance.
> 
> Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>

This is different from how every other NAPI driver does this.

You should have a single NAPI context, that handles both TX and RX
processing.  Except, that for TX processing, no work budget
adjustments are made.  You simply unconditionally process all pending
TX work without accounting it into the POLL call budget.

I have no idea why this driver tried to split the RX and TX
work like this, it accomplishes nothing but add overhead.
Simply add the TX completion code to the RX poll handler
and that's all you need to do.  Also, make sure to run TX
polling work before RX polling work, this makes fresh SKBs
available for responses generated by RX packet processing.

I bet this is why you really saw performance problems, rather than
something to do with running it directly in interrupt context.  There
should be zero gain from that if you do the TX poll work properly in
the RX poll handler.  When you free TX packets in hardware interrupt
context using dev_kfree_skb_any() that just schedules a software
interrupt to do the actual SKB free, which adds just more overhead for
TX processing work.  You aren't avoiding software IRQ work by doing TX
processing in the hardware interrupt handler, in fact you
theoretically are doing more.

So the only conclusion I can come to is that what is important is
doing the TX completion work before the RX packets get processed in
the NAPI poll handler, and you accomplish that more efficiently and
more properly by simply moving the TX completion work to the top of
the RX poll handler code.

You also failed to provide any indication of what kind of
performance change occurs due to this patch, you just say
that it does occur.  This gives people very little confidence
in your changes, nor does it give them an idea of the magnitude
of the performance changes caused by this change.

The kind of analysis I did above is the kind of things I expect
to find in a commit message that makes delicate changes like this
for reasons of performance, but I see none of that.

I'm not applying this patch, and your commit messages are still way
too terse.  You don't go into the testing you did, how you discovered
the problem, what made you decide to solve the problem this way, why
that approache works, nor the precise performance improvements you
saw.  In short, you give zero information as to the merits of this
change, and therefore nobody can review it properly.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - May 20, 2009, 9:25 a.m.
David Miller a écrit :
> From: Ajit Khaparde <ajitk@serverengines.com>
> Date: Tue, 19 May 2009 17:40:58 +0530
> 
>> This patch will remove napi in tx path and do Tx compleiton
>> processing in interrupt context.  This makes Tx completion
>> processing simpler without loss of performance.
>>
>> Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
> 
> This is different from how every other NAPI driver does this.
> 
> You should have a single NAPI context, that handles both TX and RX
> processing.  Except, that for TX processing, no work budget
> adjustments are made.  You simply unconditionally process all pending
> TX work without accounting it into the POLL call budget.
> 
> I have no idea why this driver tried to split the RX and TX
> work like this, it accomplishes nothing but add overhead.
> Simply add the TX completion code to the RX poll handler
> and that's all you need to do.  Also, make sure to run TX
> polling work before RX polling work, this makes fresh SKBs
> available for responses generated by RX packet processing.
> 
> I bet this is why you really saw performance problems, rather than
> something to do with running it directly in interrupt context.  There
> should be zero gain from that if you do the TX poll work properly in
> the RX poll handler.  When you free TX packets in hardware interrupt
> context using dev_kfree_skb_any() that just schedules a software
> interrupt to do the actual SKB free, which adds just more overhead for
> TX processing work.  You aren't avoiding software IRQ work by doing TX
> processing in the hardware interrupt handler, in fact you
> theoretically are doing more.
> 
> So the only conclusion I can come to is that what is important is
> doing the TX completion work before the RX packets get processed in
> the NAPI poll handler, and you accomplish that more efficiently and
> more properly by simply moving the TX completion work to the top of
> the RX poll handler code.
> 

Thanks David for this analysis

I would like to point a scalability problem we currently have with non
multiqueue devices, and multi core host with the schem you described/advocated.

(this has nothing to do with the be2net patch, please forgive me for jumping in)

When a lot of network trafic is handled by one device, we enter in a
ksofirqd/napi mode, where one cpu is almost dedicated in handling
both TX completions and RX completions, while other cpus
run application code (and some parts of TCP/UDP stack )

Thats really expensive because of many cache line ping pongs occurring.

In that case, it would make sense to transfert most part of the TX completion work
to the other cpus (cpus that order the xmits actually). skb freeing of course,
and sock_wfree() callbacks...

So maybe some NIC device drivers could let their ndo_start_xmit()
do some cleanup work of previously sent skbs. If correctly done,
we could lower number of cache line ping pongs.

This would give a breath to the cpu that would only take care of RX completions,
and probably give better throughput. Some machines out there want to transmit
lot of frames, while receiving few ones...


There is also a minor latency problem with current schem :
Taking care of TX completion takes some time and delay RX handling, increasing latencies
of incoming trafic.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ajit Khaparde - May 20, 2009, 12:56 p.m.
Comments inline.
 
Thanks
Ajit

On 19/05/09 15:13 -0700, David Miller wrote:
> From: Ajit Khaparde <ajitk@serverengines.com>
> Date: Tue, 19 May 2009 17:40:58 +0530
> 
> > This patch will remove napi in tx path and do Tx compleiton
> > processing in interrupt context.  This makes Tx completion
> > processing simpler without loss of performance.
> > 
> > Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
> 
> This is different from how every other NAPI driver does this.
> 
> You should have a single NAPI context, that handles both TX and RX
> processing.  Except, that for TX processing, no work budget
> adjustments are made.  You simply unconditionally process all pending
> TX work without accounting it into the POLL call budget.
> 
> I have no idea why this driver tried to split the RX and TX
> work like this, it accomplishes nothing but add overhead.
> Simply add the TX completion code to the RX poll handler
> and that's all you need to do.  Also, make sure to run TX
> polling work before RX polling work, this makes fresh SKBs
> available for responses generated by RX packet processing.
Splitting RX and TX interrupts helps with performance in certain work loads.
TX completion processing load is not significant in TCP traffic due to
TSO.  RX and TX completion processing load on same CPU limits throughput
in work loads that have significant UDP RX and TX.

> 
> I bet this is why you really saw performance problems, rather than
> something to do with running it directly in interrupt context.  There
> should be zero gain from that if you do the TX poll work properly in
> the RX poll handler.  When you free TX packets in hardware interrupt
> context using dev_kfree_skb_any() that just schedules a software
> interrupt to do the actual SKB free, which adds just more overhead for
> TX processing work.  You aren't avoiding software IRQ work by doing TX
> processing in the hardware interrupt handler, in fact you
> theoretically are doing more.
dev_kfree_skb_any() scheduling a software interrupt was something that we
had overlooked.

> 
> So the only conclusion I can come to is that what is important is
> doing the TX completion work before the RX packets get processed in
> the NAPI poll handler, and you accomplish that more efficiently and
> more properly by simply moving the TX completion work to the top of
> the RX poll handler code.
> 
> You also failed to provide any indication of what kind of
> performance change occurs due to this patch, you just say
> that it does occur.  This gives people very little confidence
> in your changes, nor does it give them an idea of the magnitude
> of the performance changes caused by this change.
> 
> The kind of analysis I did above is the kind of things I expect
> to find in a commit message that makes delicate changes like this
> for reasons of performance, but I see none of that.
> 
> I'm not applying this patch, and your commit messages are still way
> too terse.  You don't go into the testing you did, how you discovered
> the problem, what made you decide to solve the problem this way, why
> that approache works, nor the precise performance improvements you
> saw.  In short, you give zero information as to the merits of this
> change, and therefore nobody can review it properly.
We did not see any change in throughput or CPU utilization by doing TX
completion processing in interrupt context instead of the NAPI context.
The objective of the patch was to simplify the TX completion processing.

We will come back with a proper patch if necessary, after some more analysis.

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - May 20, 2009, 7:46 p.m.
From: Ajit Khaparde <ajitk@serverengines.com>
Date: Wed, 20 May 2009 18:26:11 +0530

> We will come back with a proper patch if necessary, after some more analysis.

Thank you.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - May 21, 2009, 12:25 a.m.
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 20 May 2009 11:25:44 +0200

> When a lot of network trafic is handled by one device, we enter in a
> ksofirqd/napi mode, where one cpu is almost dedicated in handling
> both TX completions and RX completions, while other cpus
> run application code (and some parts of TCP/UDP stack )
> 
> Thats really expensive because of many cache line ping pongs occurring.
>
> In that case, it would make sense to transfert most part of the TX
> completion work to the other cpus (cpus that order the xmits
> actually). skb freeing of course, and sock_wfree() callbacks...

Yes and that kind of idea can be combined with the SW multiqueue
efforts such as those patches posted by google the other week.

> So maybe some NIC device drivers could let their ndo_start_xmit()
> do some cleanup work of previously sent skbs. If correctly done,
> we could lower number of cache line ping pongs.

That's another idea.  However the ordering necessary to do this
correctly on some chips might make the cost of it prohibitive.  For
example, it might only be safe to check the consumer pointer value
DMA's by a device into the status block after an IRQ is received
unless some expensive synchronization (f.e. a register read) is
performed first.

> There is also a minor latency problem with current schem : Taking
> care of TX completion takes some time and delay RX handling,
> increasing latencies of incoming trafic.

One thing that one must understand is that deferring any SKB freeing
increases the size of the working set of memory that the CPU has
to access.  Buffer reuse is absolutely essential to keep the working
set of unfree'd data under control.

This working set bloating effect is also, unfortunately, a hallmark of
RCU.  Especially before we had softint based RCU available.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/net/benet/be.h b/drivers/net/benet/be.h
index b4bb06f..6d149a4 100644
--- a/drivers/net/benet/be.h
+++ b/drivers/net/benet/be.h
@@ -159,8 +159,6 @@  struct be_eq_obj {
 	u16 min_eqd;		/* in usecs */
 	u16 max_eqd;		/* in usecs */
 	u16 cur_eqd;		/* in usecs */
-
-	struct napi_struct napi;
 };
 
 struct be_tx_obj {
@@ -179,6 +177,7 @@  struct be_rx_page_info {
 };
 
 struct be_rx_obj {
+	struct napi_struct napi;
 	struct be_queue_info q;
 	struct be_queue_info cq;
 	struct be_rx_page_info page_info_tbl[RX_Q_LEN];
diff --git a/drivers/net/benet/be_main.c b/drivers/net/benet/be_main.c
index ae2f6b5..640337c 100644
--- a/drivers/net/benet/be_main.c
+++ b/drivers/net/benet/be_main.c
@@ -991,7 +991,7 @@  static void be_tx_compl_process(struct be_adapter *adapter, u16 last_index)
 
 	atomic_sub(num_wrbs, &txq->used);
 
-	kfree_skb(sent_skb);
+	dev_kfree_skb_any(sent_skb);
 }
 
 static void be_rx_q_clean(struct be_adapter *adapter)
@@ -1216,20 +1216,66 @@  static int event_handle(struct be_ctrl_info *ctrl,
 
 	/* We can see an interrupt and no event */
 	be_eq_notify(ctrl, eq_obj->q.id, true, true, num);
-	if (num)
-		napi_schedule(&eq_obj->napi);
 
 	return num;
 }
 
+static int rx_event_handle(struct be_adapter *adapter)
+{
+	int evts;
+
+	evts = event_handle(&adapter->ctrl, &adapter->rx_eq);
+	if (evts)
+		napi_schedule(&adapter->rx_obj.napi);
+
+	return evts;
+}
+
+static int tx_event_handle(struct be_adapter *adapter)
+{
+	struct be_tx_obj *tx_obj = &adapter->tx_obj;
+	struct be_queue_info *tx_cq = &tx_obj->cq;
+	struct be_queue_info *txq = &tx_obj->q;
+	struct be_eth_tx_compl *txcp;
+	u32 num_cmpl = 0;
+	u16 end_idx;
+	int evts;
+
+	evts = event_handle(&adapter->ctrl, &adapter->tx_eq);
+	if (!evts)
+		goto done;
+
+	while ((txcp = be_tx_compl_get(adapter))) {
+		end_idx = AMAP_GET_BITS(struct amap_eth_tx_compl,
+					wrb_index, txcp);
+		be_tx_compl_process(adapter, end_idx);
+		num_cmpl++;
+	}
+
+	/* As Tx wrbs have been freed up, wake up netdev queue if
+	 * it was stopped due to lack of tx wrbs.
+	 */
+	if (netif_queue_stopped(adapter->netdev) &&
+			atomic_read(&txq->used) < txq->len / 2) {
+		netif_wake_queue(adapter->netdev);
+	}
+
+	be_cq_notify(&adapter->ctrl, tx_cq->id, true, num_cmpl);
+
+	drvr_stats(adapter)->be_tx_events++;
+	drvr_stats(adapter)->be_tx_compl += num_cmpl;
+
+done:
+	return evts;
+}
+
 static irqreturn_t be_intx(int irq, void *dev)
 {
 	struct be_adapter *adapter = dev;
-	struct be_ctrl_info *ctrl = &adapter->ctrl;
 	int rx, tx;
 
-	tx = event_handle(ctrl, &adapter->tx_eq);
-	rx = event_handle(ctrl, &adapter->rx_eq);
+	rx = rx_event_handle(adapter);
+	tx = tx_event_handle(adapter);
 
 	if (rx || tx)
 		return IRQ_HANDLED;
@@ -1241,7 +1287,7 @@  static irqreturn_t be_msix_rx(int irq, void *dev)
 {
 	struct be_adapter *adapter = dev;
 
-	event_handle(&adapter->ctrl, &adapter->rx_eq);
+	rx_event_handle(adapter);
 
 	return IRQ_HANDLED;
 }
@@ -1250,7 +1296,7 @@  static irqreturn_t be_msix_tx(int irq, void *dev)
 {
 	struct be_adapter *adapter = dev;
 
-	event_handle(&adapter->ctrl, &adapter->tx_eq);
+	tx_event_handle(adapter);
 
 	return IRQ_HANDLED;
 }
@@ -1270,9 +1316,9 @@  static inline bool do_lro(struct be_adapter *adapter,
 
 int be_poll_rx(struct napi_struct *napi, int budget)
 {
-	struct be_eq_obj *rx_eq = container_of(napi, struct be_eq_obj, napi);
+	struct be_rx_obj *rx_obj = container_of(napi, struct be_rx_obj, napi);
 	struct be_adapter *adapter =
-		container_of(rx_eq, struct be_adapter, rx_eq);
+		container_of(rx_obj, struct be_adapter, rx_obj);
 	struct be_queue_info *rx_cq = &adapter->rx_obj.cq;
 	struct be_eth_rx_compl *rxcp;
 	u32 work_done;
@@ -1305,44 +1351,6 @@  int be_poll_rx(struct napi_struct *napi, int budget)
 	return work_done;
 }
 
-/* For TX we don't honour budget; consume everything */
-int be_poll_tx(struct napi_struct *napi, int budget)
-{
-	struct be_eq_obj *tx_eq = container_of(napi, struct be_eq_obj, napi);
-	struct be_adapter *adapter =
-		container_of(tx_eq, struct be_adapter, tx_eq);
-	struct be_tx_obj *tx_obj = &adapter->tx_obj;
-	struct be_queue_info *tx_cq = &tx_obj->cq;
-	struct be_queue_info *txq = &tx_obj->q;
-	struct be_eth_tx_compl *txcp;
-	u32 num_cmpl = 0;
-	u16 end_idx;
-
-	while ((txcp = be_tx_compl_get(adapter))) {
-		end_idx = AMAP_GET_BITS(struct amap_eth_tx_compl,
-					wrb_index, txcp);
-		be_tx_compl_process(adapter, end_idx);
-		num_cmpl++;
-	}
-
-	/* As Tx wrbs have been freed up, wake up netdev queue if
-	 * it was stopped due to lack of tx wrbs.
-	 */
-	if (netif_queue_stopped(adapter->netdev) &&
-			atomic_read(&txq->used) < txq->len / 2) {
-		netif_wake_queue(adapter->netdev);
-	}
-
-	napi_complete(napi);
-
-	be_cq_notify(&adapter->ctrl, tx_cq->id, true, num_cmpl);
-
-	drvr_stats(adapter)->be_tx_events++;
-	drvr_stats(adapter)->be_tx_compl += num_cmpl;
-
-	return 1;
-}
-
 static void be_worker(struct work_struct *work)
 {
 	struct be_adapter *adapter =
@@ -1505,8 +1513,7 @@  static int be_open(struct net_device *netdev)
 	/* First time posting */
 	be_post_rx_frags(adapter);
 
-	napi_enable(&rx_eq->napi);
-	napi_enable(&tx_eq->napi);
+	napi_enable(&adapter->rx_obj.napi);
 
 	be_irq_register(adapter);
 
@@ -1559,8 +1566,7 @@  static int be_close(struct net_device *netdev)
 	}
 	be_irq_unregister(adapter);
 
-	napi_disable(&rx_eq->napi);
-	napi_disable(&tx_eq->napi);
+	napi_disable(&adapter->rx_obj.napi);
 
 	be_rx_queues_destroy(adapter);
 	be_tx_queues_destroy(adapter);
@@ -1652,9 +1658,7 @@  static void be_netdev_init(struct net_device *netdev)
 
 	be_lro_init(adapter, netdev);
 
-	netif_napi_add(netdev, &adapter->rx_eq.napi, be_poll_rx,
-		BE_NAPI_WEIGHT);
-	netif_napi_add(netdev, &adapter->tx_eq.napi, be_poll_tx,
+	netif_napi_add(netdev, &adapter->rx_obj.napi, be_poll_rx,
 		BE_NAPI_WEIGHT);
 
 	netif_carrier_off(netdev);