diff mbox

[net-next,09/14] tg3: Improve small packet performance

Message ID 1280784368-4226-9-git-send-email-mcarlson@broadcom.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Matt Carlson Aug. 2, 2010, 9:26 p.m. UTC
smp_mb() inside tg3_tx_avail() is used twice in the normal
tg3_start_xmit() path (see illustration below).  The full memory
barrier is only necessary during race conditions with tx completion.
We can speed up the tx path by replacing smp_mb() in tg3_tx_avail()
with a compiler barrier.  The compiler barrier is to force the
compiler to fetch the tx_prod and tx_cons from memory.

In the race condition between tg3_start_xmit() and tg3_tx(),
we have the following situation:

tg3_start_xmit()                       tg3_tx()
    if (!tg3_tx_avail())
        BUG();

    ...

    if (!tg3_tx_avail())
        netif_tx_stop_queue();         update_tx_index();
        smp_mb();                      smp_mb();
        if (tg3_tx_avail())            if (netif_tx_queue_stopped() &&
            netif_tx_wake_queue();         tg3_tx_avail())

With smp_mb() removed from tg3_tx_avail(), we need to add smp_mb() to
tg3_start_xmit() as shown above to properly order netif_tx_stop_queue()
and tg3_tx_avail() to check the ring index.  If it is not strictly
ordered, the tx queue can be stopped forever.

This improves performance by about 3% with 2 ports running
bi-directional 64-byte packets.

Reviewed-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
---
 drivers/net/tg3.c |   24 +++++++++++++++++++++++-
 1 files changed, 23 insertions(+), 1 deletions(-)

Comments

Anton Blanchard Aug. 4, 2010, 10:27 p.m. UTC | #1
Hi,

Just saw this go in:
  
>  static inline u32 tg3_tx_avail(struct tg3_napi *tnapi)
>  {
> -	smp_mb();
> +	/* Tell compiler to fetch tx indices from memory. */
> +	barrier();
>  	return tnapi->tx_pending -
>  	       ((tnapi->tx_prod - tnapi->tx_cons) & (TG3_TX_RING_SIZE - 1));
>  }

Which worries me. Are we sure we don't need any ordering (eg smp_rmb)?
A compiler barrier does nothing to ensure two loads are ordered.

Anton
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matt Carlson Aug. 4, 2010, 10:46 p.m. UTC | #2
On Wed, Aug 04, 2010 at 03:27:41PM -0700, Anton Blanchard wrote:
> 
> Hi,
> 
> Just saw this go in:
>   
> >  static inline u32 tg3_tx_avail(struct tg3_napi *tnapi)
> >  {
> > -	smp_mb();
> > +	/* Tell compiler to fetch tx indices from memory. */
> > +	barrier();
> >  	return tnapi->tx_pending -
> >  	       ((tnapi->tx_prod - tnapi->tx_cons) & (TG3_TX_RING_SIZE - 1));
> >  }
> 
> Which worries me. Are we sure we don't need any ordering (eg smp_rmb)?
> A compiler barrier does nothing to ensure two loads are ordered.
> 
> Anton

The compiler barrier makes sure the loads stay roughly in the same
location.  In the places where the ordering really matters, we have the
memory barrier inlined into the calling code.  In the rest of the
places, we can relax the ordering requirement because the final value is
just used as an estimate.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Chan Aug. 4, 2010, 10:47 p.m. UTC | #3
On Wed, 2010-08-04 at 15:27 -0700, Anton Blanchard wrote:
> Hi,
> 
> Just saw this go in:
>   
> >  static inline u32 tg3_tx_avail(struct tg3_napi *tnapi)
> >  {
> > -	smp_mb();
> > +	/* Tell compiler to fetch tx indices from memory. */
> > +	barrier();
> >  	return tnapi->tx_pending -
> >  	       ((tnapi->tx_prod - tnapi->tx_cons) & (TG3_TX_RING_SIZE - 1));
> >  }
> 
> Which worries me. Are we sure we don't need any ordering (eg smp_rmb)?
> A compiler barrier does nothing to ensure two loads are ordered.

We generally only get an estimate of the available tx ring size when we
call tg3_tx_avail(), so memory barriers are not generally needed.  We
put a compiler barrier there to make sure that the compiler will fetch
the tx_prod and tx_cons from memory to give us a better estimate.

In specific cases detailed in the patch description, we do need memory
barriers when we call netif_tx_stop_queue() and then check for the tx
ring.  We decided to put memory barriers exactly where they're needed
instead of inside tg3_tx_avail() which is an overkill.

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Anton Blanchard Aug. 4, 2010, 11:08 p.m. UTC | #4
Hi,

> We generally only get an estimate of the available tx ring size when we
> call tg3_tx_avail(), so memory barriers are not generally needed.  We
> put a compiler barrier there to make sure that the compiler will fetch
> the tx_prod and tx_cons from memory to give us a better estimate.
> 
> In specific cases detailed in the patch description, we do need memory
> barriers when we call netif_tx_stop_queue() and then check for the tx
> ring.  We decided to put memory barriers exactly where they're needed
> instead of inside tg3_tx_avail() which is an overkill.

Thanks Matt and Michael. I was pretty sure you were thinking about
ordering issues but wanted to double check.

Anton
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 32e3a3d..820a7dd 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -4386,7 +4386,8 @@  static void tg3_tx_recover(struct tg3 *tp)
 
 static inline u32 tg3_tx_avail(struct tg3_napi *tnapi)
 {
-	smp_mb();
+	/* Tell compiler to fetch tx indices from memory. */
+	barrier();
 	return tnapi->tx_pending -
 	       ((tnapi->tx_prod - tnapi->tx_cons) & (TG3_TX_RING_SIZE - 1));
 }
@@ -5670,6 +5671,13 @@  static netdev_tx_t tg3_start_xmit(struct sk_buff *skb,
 	tnapi->tx_prod = entry;
 	if (unlikely(tg3_tx_avail(tnapi) <= (MAX_SKB_FRAGS + 1))) {
 		netif_tx_stop_queue(txq);
+
+		/* netif_tx_stop_queue() must be done before checking
+		 * checking tx index in tg3_tx_avail() below, because in
+		 * tg3_tx(), we update tx index before checking for
+		 * netif_tx_queue_stopped().
+		 */
+		smp_mb();
 		if (tg3_tx_avail(tnapi) > TG3_TX_WAKEUP_THRESH(tnapi))
 			netif_tx_wake_queue(txq);
 	}
@@ -5715,6 +5723,13 @@  static int tg3_tso_bug(struct tg3 *tp, struct sk_buff *skb)
 	/* Estimate the number of fragments in the worst case */
 	if (unlikely(tg3_tx_avail(&tp->napi[0]) <= frag_cnt_est)) {
 		netif_stop_queue(tp->dev);
+
+		/* netif_tx_stop_queue() must be done before checking
+		 * checking tx index in tg3_tx_avail() below, because in
+		 * tg3_tx(), we update tx index before checking for
+		 * netif_tx_queue_stopped().
+		 */
+		smp_mb();
 		if (tg3_tx_avail(&tp->napi[0]) <= frag_cnt_est)
 			return NETDEV_TX_BUSY;
 
@@ -5950,6 +5965,13 @@  static netdev_tx_t tg3_start_xmit_dma_bug(struct sk_buff *skb,
 	tnapi->tx_prod = entry;
 	if (unlikely(tg3_tx_avail(tnapi) <= (MAX_SKB_FRAGS + 1))) {
 		netif_tx_stop_queue(txq);
+
+		/* netif_tx_stop_queue() must be done before checking
+		 * checking tx index in tg3_tx_avail() below, because in
+		 * tg3_tx(), we update tx index before checking for
+		 * netif_tx_queue_stopped().
+		 */
+		smp_mb();
 		if (tg3_tx_avail(tnapi) > TG3_TX_WAKEUP_THRESH(tnapi))
 			netif_tx_wake_queue(txq);
 	}