diff mbox

[RFC] myri10ge: small rx_done refactoring

Message ID 20110323124939.GA7834@redhat.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Stanislaw Gruszka March 23, 2011, 12:52 p.m. UTC
myri10ge: small rx_done refactoring

Add lro_enable variable to read NETIF_F_LRO flag only once per napi poll
call. This should fix theoretical race condition with
myri10ge_set_rx_csum() and myri10ge_set_flags() where flag NETIF_F_LRO
can be changed.

On the way reduce myri10ge_rx_done() number of arguments and calls by
moving mgp->small_bytes check into that function. That reduce code size

from:
   text	   data	    bss	    dec	    hex	filename
  36644	    248	    100	  36992	   9080	drivers/net/myri10ge/myri10ge.o

to:
   text	   data	    bss	    dec	    hex	filename
  36037	    247	    100	  36384	   8e20	drivers/net/myri10ge/myri10ge.o

on my i686 system, what should also make myri10ge_clean_rx_done()
being faster.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andrew Gallatin March 23, 2011, 3:21 p.m. UTC | #1
On 03/23/11 08:52, Stanislaw Gruszka wrote:

> on my i686 system, what should also make myri10ge_clean_rx_done()
> being faster.

I tested this on my very old, very weak dual-core athlon64 systems.
These machines can barely achieve 10Gb/s using a 1500b MTU with LRO.
Running 35 60 second netperf tests into the machines with the stock
driver, and again with this patch applied, I see a tiny bandwidth
increase (1.4Mb/s on average) which is statistically significant
( p <  0.001). There is no statistically significant CPU load
reduction.

> +	if (len<= mgp->small_bytes) {
> +		rx =&ss->rx_small;
> +		bytes = mgp->small_bytes;
> +	} else {
> +		rx =&ss->rx_big,

Small nit:  the "," above should be a ";"

Between the small bandwidth increase, and the code size reduction,
I'm very appreciative of this patch.

Thank you,

Drew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger March 23, 2011, 3:33 p.m. UTC | #2
On Wed, 23 Mar 2011 13:52:04 +0100
Stanislaw Gruszka <sgruszka@redhat.com> wrote:

> Add lro_enable variable to read NETIF_F_LRO flag only once per napi poll
> call. This should fix theoretical race condition with
> myri10ge_set_rx_csum() and myri10ge_set_flags() where flag NETIF_F_LRO
> can be changed.

You may need a barrier or the race may still be there.
The driver seems to use mb() where wmb() is intended, and never use rmb()?
Stanislaw Gruszka March 24, 2011, 8:16 a.m. UTC | #3
On Wed, Mar 23, 2011 at 08:33:57AM -0700, Stephen Hemminger wrote:
> On Wed, 23 Mar 2011 13:52:04 +0100
> Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> 
> > Add lro_enable variable to read NETIF_F_LRO flag only once per napi poll
> > call. This should fix theoretical race condition with
> > myri10ge_set_rx_csum() and myri10ge_set_flags() where flag NETIF_F_LRO
> > can be changed.
> 
> You may need a barrier or the race may still be there.

I don't understand why barrier in that case is need.

What I tried to avoid is.

myri10ge_clean_rx_done():

if (dev->features & NETIF_F_LRO) 
	setup lro 
					myri10ge_set_flags()

if (dev->features & NETIF_F_LRO)
	flush lro

Now we read dev->features & NETIF_F_LRO only once to local
lro_enabled variable. So we can not flush without setup
or setup without flush. No idea why memory barries is still
needed.

> The driver seems to use mb() where wmb() is intended, and never use rmb()?

Yes, I think we can have some optimalization here.

Stanislaw
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger March 24, 2011, 3:15 p.m. UTC | #4
On Thu, 24 Mar 2011 09:16:21 +0100
Stanislaw Gruszka <sgruszka@redhat.com> wrote:

> On Wed, Mar 23, 2011 at 08:33:57AM -0700, Stephen Hemminger wrote:
> > On Wed, 23 Mar 2011 13:52:04 +0100
> > Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> > 
> > > Add lro_enable variable to read NETIF_F_LRO flag only once per napi poll
> > > call. This should fix theoretical race condition with
> > > myri10ge_set_rx_csum() and myri10ge_set_flags() where flag NETIF_F_LRO
> > > can be changed.
> > 
> > You may need a barrier or the race may still be there.
> 
> I don't understand why barrier in that case is need.
> 
> What I tried to avoid is.
> 
> myri10ge_clean_rx_done():
> 
> if (dev->features & NETIF_F_LRO) 
> 	setup lro 
> 					myri10ge_set_flags()
> 
> if (dev->features & NETIF_F_LRO)
> 	flush lro
> 
> Now we read dev->features & NETIF_F_LRO only once to local
> lro_enabled variable. So we can not flush without setup
> or setup without flush. No idea why memory barries is still
> needed.
> 
> > The driver seems to use mb() where wmb() is intended, and never use rmb()?
> 
> Yes, I think we can have some optimalization here.
> 

Without barrier there is no guarantee that compiler read the flags
into a local variable. It is free to do the same thing as the original
code.
Stanislaw Gruszka March 24, 2011, 3:59 p.m. UTC | #5
On Thu, Mar 24, 2011 at 08:15:39AM -0700, Stephen Hemminger wrote:
> > On Wed, Mar 23, 2011 at 08:33:57AM -0700, Stephen Hemminger wrote:
> > > On Wed, 23 Mar 2011 13:52:04 +0100
> > > Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> > > 
> > > > Add lro_enable variable to read NETIF_F_LRO flag only once per napi poll
> > > > call. This should fix theoretical race condition with
> > > > myri10ge_set_rx_csum() and myri10ge_set_flags() where flag NETIF_F_LRO
> > > > can be changed.
> > > 
> > > You may need a barrier or the race may still be there.
> > 
> > I don't understand why barrier in that case is need.
> > 
> > What I tried to avoid is.
> > 
> > myri10ge_clean_rx_done():
> > 
> > if (dev->features & NETIF_F_LRO) 
> > 	setup lro 
> > 					myri10ge_set_flags()
> > 
> > if (dev->features & NETIF_F_LRO)
> > 	flush lro
> > 
> > Now we read dev->features & NETIF_F_LRO only once to local
> > lro_enabled variable. So we can not flush without setup
> > or setup without flush. No idea why memory barries is still
> > needed.
> > 
> > > The driver seems to use mb() where wmb() is intended, and never use rmb()?
> > 
> > Yes, I think we can have some optimalization here.
> > 
> 
> Without barrier there is no guarantee that compiler read the flags
> into a local variable. It is free to do the same thing as the original
> code.

Ok, so C code like:

code1
if (dev->features & NETIF_F_LRO) 
	branch1
code2;
if (dev->features & NETIF_F_LRO)
	branch2

and

bool lro_enabled = dev->features & NETIF_F_LRO;
code1
if (lro_enabled)
	branch1
code2
if (lro_enabled)
	branch2

can give the same assembly output.

It's really hard for me to understand that. I could
understand, if we would get global variable directly
like:
bool lro_enabled = dev->lro_enabled;
instead of:
bool lro_enabled = dev->features & NETIF_F_LRO;

David, can you confirm that Staphen is correct?

Also where this barrier() should go. Before  
"bool lro_enabled = dev->features & NETIF_F_LRO;"
or after?

Stanislaw
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings March 24, 2011, 4:42 p.m. UTC | #6
On Thu, 2011-03-24 at 16:59 +0100, Stanislaw Gruszka wrote:
> On Thu, Mar 24, 2011 at 08:15:39AM -0700, Stephen Hemminger wrote:
> > > On Wed, Mar 23, 2011 at 08:33:57AM -0700, Stephen Hemminger wrote:
> > > > On Wed, 23 Mar 2011 13:52:04 +0100
> > > > Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> > > > 
> > > > > Add lro_enable variable to read NETIF_F_LRO flag only once per napi poll
> > > > > call. This should fix theoretical race condition with
> > > > > myri10ge_set_rx_csum() and myri10ge_set_flags() where flag NETIF_F_LRO
> > > > > can be changed.
> > > > 
> > > > You may need a barrier or the race may still be there.
> > > 
> > > I don't understand why barrier in that case is need.
> > > 
> > > What I tried to avoid is.
> > > 
> > > myri10ge_clean_rx_done():
> > > 
> > > if (dev->features & NETIF_F_LRO) 
> > > 	setup lro 
> > > 					myri10ge_set_flags()
> > > 
> > > if (dev->features & NETIF_F_LRO)
> > > 	flush lro
> > > 
> > > Now we read dev->features & NETIF_F_LRO only once to local
> > > lro_enabled variable. So we can not flush without setup
> > > or setup without flush. No idea why memory barries is still
> > > needed.
> > > 
> > > > The driver seems to use mb() where wmb() is intended, and never use rmb()?
> > > 
> > > Yes, I think we can have some optimalization here.
> > > 
> > 
> > Without barrier there is no guarantee that compiler read the flags
> > into a local variable. It is free to do the same thing as the original
> > code.
> 
> Ok, so C code like:
> 
> code1
> if (dev->features & NETIF_F_LRO) 
> 	branch1
> code2;
> if (dev->features & NETIF_F_LRO)
> 	branch2
> 
> and
> 
> bool lro_enabled = dev->features & NETIF_F_LRO;
> code1
> if (lro_enabled)
> 	branch1
> code2
> if (lro_enabled)
> 	branch2
> 
> can give the same assembly output.
[...]

Yes.  A C compiler is allowed to assume that data are not shared between
multiple threads, and apply any transformations that would not affect
the behaviour of a single-threaded program.

We can make use of some gcc extensions (wrapped up in macros like
barrier()) to inhibit some such transformations.  We also assume that
access to an int, long or pointer variable can be atomic.  The
ACCESS_ONCE() macro adds volatile-qualification to such memory access,
which inhibits duplication of the access.

So, if you only care that this function has a consistent value for
lro_enabled, you can read dev->features with ACCESS_ONCE():

	bool lro_enabled = ACCESS_ONCE(dev->features) & NETIF_F_LRO;

Ben.
David Howells March 24, 2011, 5:29 p.m. UTC | #7
Stanislaw Gruszka <sgruszka@redhat.com> wrote:

> David, can you confirm that Staphen is correct?

Stephen is correct.  The compiler is perfectly at liberty to merge the two
loads if the value being read is not marked volatile.

If you stick a barrier() in there between the reads, then I think the compiler
will be required to emit two load instructions.  However, the _CPU_ is then
entitled to merge them.

If you don't want the CPU to merge them, you have to use smp_rmb() or smp_mb()
between.

However, the compiler is also allowed to re-read the variable (ie. emit two
load instructions) if it would otherwise have to save the value on the stack
to free up a register, unless the pointed to value is marked volatile.

If you want to ensure that the value is read once only, then you need to use
ACCESS_ONCE() or stick a read/full barrier of some degree after the read.

Note the use of a barrier implies a partial ordering between two memory
accesses in the same instruction stream.  If you can't say which two memory
memory accesses you want to order, you probably shouldn't be using a barrier.

David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
index 24386a8..2e71240 100644
--- a/drivers/net/myri10ge/myri10ge.c
+++ b/drivers/net/myri10ge/myri10ge.c
@@ -1312,17 +1312,26 @@  myri10ge_unmap_rx_page(struct pci_dev *pdev,
 				 * page into an skb */
 
 static inline int
-myri10ge_rx_done(struct myri10ge_slice_state *ss, struct myri10ge_rx_buf *rx,
-		 int bytes, int len, __wsum csum)
+myri10ge_rx_done(struct myri10ge_slice_state *ss, int len, __wsum csum,
+		 bool lro_enabled)
 {
 	struct myri10ge_priv *mgp = ss->mgp;
 	struct sk_buff *skb;
 	struct skb_frag_struct rx_frags[MYRI10GE_MAX_FRAGS_PER_FRAME];
-	int i, idx, hlen, remainder;
+	struct myri10ge_rx_buf *rx;
+	int i, idx, hlen, remainder, bytes;
 	struct pci_dev *pdev = mgp->pdev;
 	struct net_device *dev = mgp->dev;
 	u8 *va;
 
+	if (len <= mgp->small_bytes) {
+		rx = &ss->rx_small;
+		bytes = mgp->small_bytes;
+	} else {
+		rx = &ss->rx_big,
+		bytes = mgp->big_bytes;
+	}
+
 	len += MXGEFW_PAD;
 	idx = rx->cnt & rx->mask;
 	va = page_address(rx->info[idx].page) + rx->info[idx].page_offset;
@@ -1341,7 +1350,7 @@  myri10ge_rx_done(struct myri10ge_slice_state *ss, struct myri10ge_rx_buf *rx,
 		remainder -= MYRI10GE_ALLOC_SIZE;
 	}
 
-	if (dev->features & NETIF_F_LRO) {
+	if (lro_enabled) {
 		rx_frags[0].page_offset += MXGEFW_PAD;
 		rx_frags[0].size -= MXGEFW_PAD;
 		len -= MXGEFW_PAD;
@@ -1464,6 +1473,7 @@  myri10ge_clean_rx_done(struct myri10ge_slice_state *ss, int budget)
 	struct myri10ge_rx_done *rx_done = &ss->rx_done;
 	struct myri10ge_priv *mgp = ss->mgp;
 	struct net_device *netdev = mgp->dev;
+	bool lro_enabled = (netdev->features & NETIF_F_LRO) ? true : false;
 	unsigned long rx_bytes = 0;
 	unsigned long rx_packets = 0;
 	unsigned long rx_ok;
@@ -1478,14 +1488,7 @@  myri10ge_clean_rx_done(struct myri10ge_slice_state *ss, int budget)
 		length = ntohs(rx_done->entry[idx].length);
 		rx_done->entry[idx].length = 0;
 		checksum = csum_unfold(rx_done->entry[idx].checksum);
-		if (length <= mgp->small_bytes)
-			rx_ok = myri10ge_rx_done(ss, &ss->rx_small,
-						 mgp->small_bytes,
-						 length, checksum);
-		else
-			rx_ok = myri10ge_rx_done(ss, &ss->rx_big,
-						 mgp->big_bytes,
-						 length, checksum);
+		rx_ok = myri10ge_rx_done(ss, length, checksum, lro_enabled);
 		rx_packets += rx_ok;
 		rx_bytes += rx_ok * (unsigned long)length;
 		cnt++;
@@ -1497,7 +1500,7 @@  myri10ge_clean_rx_done(struct myri10ge_slice_state *ss, int budget)
 	ss->stats.rx_packets += rx_packets;
 	ss->stats.rx_bytes += rx_bytes;
 
-	if (netdev->features & NETIF_F_LRO)
+	if (lro_enabled)
 		lro_flush_all(&rx_done->lro_mgr);
 
 	/* restock receive rings if needed */