Patchwork tilegx: fix some issues in the SW TSO support

login
register
mail settings
Submitter Chris Metcalf
Date Oct. 25, 2012, 5:25 p.m.
Message ID <201210251737.q9PHbgev032681@farm-0012.internal.tilera.com>
Download mbox | patch
Permalink /patch/194261/
State Accepted
Delegated to: David Miller
Headers show

Comments

Chris Metcalf - Oct. 25, 2012, 5:25 p.m.
This change correctly computes the header length and data length in
the fragments to avoid a bug where we would end up with extremely
slow performance.  Also adopt use of skb_frag_size() accessor.

Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Cc: stable@vger.kernel.org [v3.6]
---
 drivers/net/ethernet/tile/tilegx.c |   35 +++++++++++++++++++----------------
 1 file changed, 19 insertions(+), 16 deletions(-)
Ben Hutchings - Oct. 25, 2012, 5:51 p.m.
On Thu, 2012-10-25 at 13:25 -0400, Chris Metcalf wrote:
> This change correctly computes the header length and data length in
> the fragments to avoid a bug where we would end up with extremely
> slow performance.  Also adopt use of skb_frag_size() accessor.
[...]

By the way, since you're doing soft-TSO you should probably set
net_device::gso_max_segs, as explained in:

commit 30b678d844af3305cda5953467005cebb5d7b687
Author: Ben Hutchings <bhutchings@solarflare.com>
Date:   Mon Jul 30 15:57:00 2012 +0000

    net: Allow driver to limit number of GSO segments per skb

Ben.
Chris Metcalf - Oct. 25, 2012, 6:16 p.m.
On 10/25/2012 1:51 PM, Ben Hutchings wrote:
> On Thu, 2012-10-25 at 13:25 -0400, Chris Metcalf wrote:
>> This change correctly computes the header length and data length in
>> the fragments to avoid a bug where we would end up with extremely
>> slow performance.  Also adopt use of skb_frag_size() accessor.
> [...]
>
> By the way, since you're doing soft-TSO you should probably set
> net_device::gso_max_segs, as explained in:
>
> commit 30b678d844af3305cda5953467005cebb5d7b687
> Author: Ben Hutchings <bhutchings@solarflare.com>
> Date:   Mon Jul 30 15:57:00 2012 +0000
>
>     net: Allow driver to limit number of GSO segments per skb

We currently have a hard limit of 2048 equeue entries (effectively,
segments) per interface.  The commit message suggests 861 is the largest
number we're likely to see, so I think we're OK from a correctness point of
view.  But, perhaps, we could end up with multiple cores trying to push
separate flows each with this tiny MSS issue, and they would then be
contending for the 2048 equeue entries, and performance might suffer.  I
don't have a good instinct on what value we should choose to set here; I
see that sfc uses 100.

So, we could do nothing since it seems we're technically safe; we could say
2048 to be explicit; we could pick a random fraction of the full size to
help avoid contention effects, like 1024 or 512; or we could mimic sfc and
just say 100.  What do you think?
Ben Hutchings - Oct. 25, 2012, 7:36 p.m.
On Thu, 2012-10-25 at 14:16 -0400, Chris Metcalf wrote:
> On 10/25/2012 1:51 PM, Ben Hutchings wrote:
> > On Thu, 2012-10-25 at 13:25 -0400, Chris Metcalf wrote:
> >> This change correctly computes the header length and data length in
> >> the fragments to avoid a bug where we would end up with extremely
> >> slow performance.  Also adopt use of skb_frag_size() accessor.
> > [...]
> >
> > By the way, since you're doing soft-TSO you should probably set
> > net_device::gso_max_segs, as explained in:
> >
> > commit 30b678d844af3305cda5953467005cebb5d7b687
> > Author: Ben Hutchings <bhutchings@solarflare.com>
> > Date:   Mon Jul 30 15:57:00 2012 +0000
> >
> >     net: Allow driver to limit number of GSO segments per skb
> 
> We currently have a hard limit of 2048 equeue entries (effectively,
> segments) per interface.  The commit message suggests 861 is the largest
> number we're likely to see, so I think we're OK from a correctness point of
> view.
>
> But, perhaps, we could end up with multiple cores trying to push
> separate flows each with this tiny MSS issue, and they would then be
> contending for the 2048 equeue entries, and performance might suffer.  I
> don't have a good instinct on what value we should choose to set here; I
> see that sfc uses 100.
>
> So, we could do nothing since it seems we're technically safe; we could say
> 2048 to be explicit; we could pick a random fraction of the full size to
> help avoid contention effects, like 1024 or 512; or we could mimic sfc and
> just say 100.  What do you think?

You need at least 2 descriptors per segment, plus 1 for each fragment
boundary, so the maximum would be something like
(2048 - MAX_SKB_FRAGS) / 2 ~= 1015.  So the worst-case skb can fit in
the queue, but it requires so much of the queue space that the queue
might not become sufficiently empty before the watchdog timer fires.

There's a simple way to test this - configure the 'attacker' host and
interface like this:

# ip link set $IFACE mtu 128
# sysctl net.ipv4.route.min_adv_mss=88

and then use something like netperf to generate a high bandwidth stream
of TCP traffic toward it from the tilegx interface.

Ben.
Ben Hutchings - Oct. 25, 2012, 7:48 p.m.
On Thu, 2012-10-25 at 20:36 +0100, Ben Hutchings wrote:
[...]
> There's a simple way to test this - configure the 'attacker' host and
> interface like this:
> 
> # ip link set $IFACE mtu 128
> # sysctl net.ipv4.route.min_adv_mss=88
> 
> and then use something like netperf to generate a high bandwidth stream
> of TCP traffic toward it from the tilegx interface.

Also you would need to run some other traffic in parallel (multiple
streams to the 'attacker' and some more normal streams) to ensure there
is real contention.

Ben.
Chris Metcalf - Oct. 26, 2012, 6:17 p.m.
On 10/25/2012 3:48 PM, Ben Hutchings wrote:
> On Thu, 2012-10-25 at 20:36 +0100, Ben Hutchings wrote:
> [...]
>> There's a simple way to test this - configure the 'attacker' host and
>> interface like this:
>>
>> # ip link set $IFACE mtu 128
>> # sysctl net.ipv4.route.min_adv_mss=88
>>
>> and then use something like netperf to generate a high bandwidth stream
>> of TCP traffic toward it from the tilegx interface.
> Also you would need to run some other traffic in parallel (multiple
> streams to the 'attacker' and some more normal streams) to ensure there
> is real contention.

Thanks!  I've filed a bugzilla within Tilera to come back and revisit this
- I think it's not a short-term problem so we'll find some time to look at
it in the future.  Or, someone reading this could try it themselves and
submit the patch :-)

Patch

diff --git a/drivers/net/ethernet/tile/tilegx.c b/drivers/net/ethernet/tile/tilegx.c
index 4e2a162..4e98100 100644
--- a/drivers/net/ethernet/tile/tilegx.c
+++ b/drivers/net/ethernet/tile/tilegx.c
@@ -1334,11 +1334,11 @@  static int tso_count_edescs(struct sk_buff *skb)
 {
 	struct skb_shared_info *sh = skb_shinfo(skb);
 	unsigned int sh_len = skb_transport_offset(skb) + tcp_hdrlen(skb);
-	unsigned int data_len = skb->data_len + skb->hdr_len - sh_len;
+	unsigned int data_len = skb->len - sh_len;
 	unsigned int p_len = sh->gso_size;
 	long f_id = -1;    /* id of the current fragment */
-	long f_size = skb->hdr_len;  /* size of the current fragment */
-	long f_used = sh_len;  /* bytes used from the current fragment */
+	long f_size = skb_headlen(skb) - sh_len;  /* current fragment size */
+	long f_used = 0;  /* bytes used from the current fragment */
 	long n;            /* size of the current piece of payload */
 	int num_edescs = 0;
 	int segment;
@@ -1353,7 +1353,7 @@  static int tso_count_edescs(struct sk_buff *skb)
 			/* Advance as needed. */
 			while (f_used >= f_size) {
 				f_id++;
-				f_size = sh->frags[f_id].size;
+				f_size = skb_frag_size(&sh->frags[f_id]);
 				f_used = 0;
 			}
 
@@ -1384,13 +1384,13 @@  static void tso_headers_prepare(struct sk_buff *skb, unsigned char *headers,
 	struct iphdr *ih;
 	struct tcphdr *th;
 	unsigned int sh_len = skb_transport_offset(skb) + tcp_hdrlen(skb);
-	unsigned int data_len = skb->data_len + skb->hdr_len - sh_len;
+	unsigned int data_len = skb->len - sh_len;
 	unsigned char *data = skb->data;
 	unsigned int ih_off, th_off, p_len;
 	unsigned int isum_seed, tsum_seed, id, seq;
 	long f_id = -1;    /* id of the current fragment */
-	long f_size = skb->hdr_len;  /* size of the current fragment */
-	long f_used = sh_len;  /* bytes used from the current fragment */
+	long f_size = skb_headlen(skb) - sh_len;  /* current fragment size */
+	long f_used = 0;  /* bytes used from the current fragment */
 	long n;            /* size of the current piece of payload */
 	int segment;
 
@@ -1405,7 +1405,7 @@  static void tso_headers_prepare(struct sk_buff *skb, unsigned char *headers,
 	isum_seed = ((0xFFFF - ih->check) +
 		     (0xFFFF - ih->tot_len) +
 		     (0xFFFF - ih->id));
-	tsum_seed = th->check + (0xFFFF ^ htons(sh_len + data_len));
+	tsum_seed = th->check + (0xFFFF ^ htons(skb->len));
 	id = ntohs(ih->id);
 	seq = ntohl(th->seq);
 
@@ -1444,7 +1444,7 @@  static void tso_headers_prepare(struct sk_buff *skb, unsigned char *headers,
 			/* Advance as needed. */
 			while (f_used >= f_size) {
 				f_id++;
-				f_size = sh->frags[f_id].size;
+				f_size = skb_frag_size(&sh->frags[f_id]);
 				f_used = 0;
 			}
 
@@ -1478,14 +1478,14 @@  static void tso_egress(struct net_device *dev, gxio_mpipe_equeue_t *equeue,
 	struct tile_net_priv *priv = netdev_priv(dev);
 	struct skb_shared_info *sh = skb_shinfo(skb);
 	unsigned int sh_len = skb_transport_offset(skb) + tcp_hdrlen(skb);
-	unsigned int data_len = skb->data_len + skb->hdr_len - sh_len;
+	unsigned int data_len = skb->len - sh_len;
 	unsigned int p_len = sh->gso_size;
 	gxio_mpipe_edesc_t edesc_head = { { 0 } };
 	gxio_mpipe_edesc_t edesc_body = { { 0 } };
 	long f_id = -1;    /* id of the current fragment */
-	long f_size = skb->hdr_len;  /* size of the current fragment */
-	long f_used = sh_len;  /* bytes used from the current fragment */
-	void *f_data = skb->data;
+	long f_size = skb_headlen(skb) - sh_len;  /* current fragment size */
+	long f_used = 0;  /* bytes used from the current fragment */
+	void *f_data = skb->data + sh_len;
 	long n;            /* size of the current piece of payload */
 	unsigned long tx_packets = 0, tx_bytes = 0;
 	unsigned int csum_start;
@@ -1516,15 +1516,18 @@  static void tso_egress(struct net_device *dev, gxio_mpipe_equeue_t *equeue,
 
 		/* Egress the payload. */
 		while (p_used < p_len) {
+			void *va;
 
 			/* Advance as needed. */
 			while (f_used >= f_size) {
 				f_id++;
-				f_size = sh->frags[f_id].size;
-				f_used = 0;
+				f_size = skb_frag_size(&sh->frags[f_id]);
 				f_data = tile_net_frag_buf(&sh->frags[f_id]);
+				f_used = 0;
 			}
 
+			va = f_data + f_used;
+
 			/* Use bytes from the current fragment. */
 			n = p_len - p_used;
 			if (n > f_size - f_used)
@@ -1533,7 +1536,7 @@  static void tso_egress(struct net_device *dev, gxio_mpipe_equeue_t *equeue,
 			p_used += n;
 
 			/* Egress a piece of the payload. */
-			edesc_body.va = va_to_tile_io_addr(f_data) + f_used;
+			edesc_body.va = va_to_tile_io_addr(va);
 			edesc_body.xfer_size = n;
 			edesc_body.bound = !(p_used < p_len);
 			gxio_mpipe_equeue_put_at(equeue, edesc_body, slot);