Patchwork Slow speed of tcp connections in a network namespace

login
register
mail settings
Submitter Eric Dumazet
Date Dec. 29, 2012, 6:58 p.m.
Message ID <1356807516.4102.4.camel@edumazet-laptop>
Download mbox | patch
Permalink /patch/208681/
State RFC
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - Dec. 29, 2012, 6:58 p.m.
Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :

> 
> Please post your new tcpdump then ;)
> 
> also post "netstat -s" from root and test ns after your wgets

Also try following bnx2 patch.

It should help GRO / TCP coalesce

bnx2 should be the last driver not using skb head_frag




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Dec. 29, 2012, 7:41 p.m.
On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote:
> Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> 
> > 
> > Please post your new tcpdump then ;)
> > 
> > also post "netstat -s" from root and test ns after your wgets
> 
> Also try following bnx2 patch.
> 
> It should help GRO / TCP coalesce
> 
> bnx2 should be the last driver not using skb head_frag

And of course, you should make sure all your bnx2 interrupts are handled
by the same cpu.

Or else, packets might be reordered because the way dev_forward_skb()
works.

(CPU X gets a bunch of packets from eth0, forward them via netif_rx() in
the local CPU X queue, NAPI is ended on eth0)

CPU Y gets a bunch of packets from eth0, forward them via netif_rx() in
the local CPU Y queue.

CPU X and Y process their local queue in // -> packets are delivered Out
of order to TCP stack

Alternative is to setup RPS on your veth1 device, to force packets being
delivered/handled by a given cpu





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Vagin - Dec. 29, 2012, 8:08 p.m.
On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote:
> On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote:
> > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> > 
> > > 
> > > Please post your new tcpdump then ;)
> > > 
> > > also post "netstat -s" from root and test ns after your wgets
> > 
> > Also try following bnx2 patch.
> > 
> > It should help GRO / TCP coalesce
> > 
> > bnx2 should be the last driver not using skb head_frag

I don't have access to the host. I'm going to test your patch tomorrow.
Thanks.

> 
> And of course, you should make sure all your bnx2 interrupts are handled
> by the same cpu.
All bnx interrupts are handled on all cpus. They are handled on the same
cpu, if a kernel is booted with msi_disable=1.

Is it right, that a received window will be less, if packets are not sorted?
Looks like a bug.

I want to say, that probably it works correctly, if packets are sorted.
But I think if packets are not sorted, it should work with the same
speed, cpu load and memory consumption may be a bit more.

> 
> Or else, packets might be reordered because the way dev_forward_skb()
> works.
> 
> (CPU X gets a bunch of packets from eth0, forward them via netif_rx() in
> the local CPU X queue, NAPI is ended on eth0)
> 
> CPU Y gets a bunch of packets from eth0, forward them via netif_rx() in
> the local CPU Y queue.
> 
> CPU X and Y process their local queue in // -> packets are delivered Out
> of order to TCP stack
> 
> Alternative is to setup RPS on your veth1 device, to force packets being
> delivered/handled by a given cpu
> 
> 
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Dec. 29, 2012, 8:20 p.m.
On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote:
> On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote:
> > On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote:
> > > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> > > 
> > > > 
> > > > Please post your new tcpdump then ;)
> > > > 
> > > > also post "netstat -s" from root and test ns after your wgets
> > > 
> > > Also try following bnx2 patch.
> > > 
> > > It should help GRO / TCP coalesce
> > > 
> > > bnx2 should be the last driver not using skb head_frag
> 
> I don't have access to the host. I'm going to test your patch tomorrow.
> Thanks.
> 
> > 
> > And of course, you should make sure all your bnx2 interrupts are handled
> > by the same cpu.
> All bnx interrupts are handled on all cpus. They are handled on the same
> cpu, if a kernel is booted with msi_disable=1.
> 
> Is it right, that a received window will be less, if packets are not sorted?
> Looks like a bug.
> 
> I want to say, that probably it works correctly, if packets are sorted.
> But I think if packets are not sorted, it should work with the same
> speed, cpu load and memory consumption may be a bit more.

Without veth, it doesnt really matter that IRQ are spread on multiple
cpus, because packets are handled in NAPI, and only one cpu runs the
eth0 NAPI handler at one time.

But as soon as packets are queued (by netif_rx()) for 'later'
processing, you can have dramatic performance decrease.

Thats why you really should make sure IRQ on your eth0 device
are handled by a single cpu.

It will help to get better performance in most cases.

echo 1 >/proc/irq/*/eth0/../smp_affinity

If it doesnt work, you might try instead :

echo 1 >/proc/irq/default_smp_affinity
<you might need to reload bnx2 module, or ifdown/ifup eth0 >



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Vagin - Dec. 29, 2012, 9:07 p.m.
On Sat, Dec 29, 2012 at 12:20:07PM -0800, Eric Dumazet wrote:
> On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote:
> > On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote:
> > > On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote:
> > > > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> > > > 
> > > > > 
> > > > > Please post your new tcpdump then ;)
> > > > > 
> > > > > also post "netstat -s" from root and test ns after your wgets
> > > > 
> > > > Also try following bnx2 patch.
> > > > 
> > > > It should help GRO / TCP coalesce
> > > > 
> > > > bnx2 should be the last driver not using skb head_frag
> > 
> > I don't have access to the host. I'm going to test your patch tomorrow.
> > Thanks.
> > 
> > > 
> > > And of course, you should make sure all your bnx2 interrupts are handled
> > > by the same cpu.
> > All bnx interrupts are handled on all cpus. They are handled on the same
> > cpu, if a kernel is booted with msi_disable=1.
> > 
> > Is it right, that a received window will be less, if packets are not sorted?
> > Looks like a bug.
> > 
> > I want to say, that probably it works correctly, if packets are sorted.
> > But I think if packets are not sorted, it should work with the same
> > speed, cpu load and memory consumption may be a bit more.
> 
> Without veth, it doesnt really matter that IRQ are spread on multiple
> cpus, because packets are handled in NAPI, and only one cpu runs the
> eth0 NAPI handler at one time.
> 
> But as soon as packets are queued (by netif_rx()) for 'later'
> processing, you can have dramatic performance decrease.
> 
> Thats why you really should make sure IRQ on your eth0 device
> are handled by a single cpu.
> 
> It will help to get better performance in most cases.

I understand this fact, but so big difference looks strange for me.

Default configuration (with the bug):
# cat /proc/interrupts  | grep eth0
  68:      10187      10188      10187      10023      10190      10185
10187      10019   PCI-MSI-edge      eth0

> 
> echo 1 >/proc/irq/*/eth0/../smp_affinity

This doesn't help.

I tryed echo 0 > /proc/irq/68/smp_affinity_list. This doesn't help too.

> 
> If it doesnt work, you might try instead :
> 
> echo 1 >/proc/irq/default_smp_affinity
> <you might need to reload bnx2 module, or ifdown/ifup eth0 >

This helps, and the bug are not reproduced in this case.

# cat /proc/interrupts  | grep eth0
  68:      60777          0          0          0          0          0
0          0   PCI-MSI-edge      eth0

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Dec. 29, 2012, 9:12 p.m.
On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote:

> Is it right, that a received window will be less, if packets are not sorted?
> Looks like a bug.

Not really a bug.

TCP is very sensitive to packet reorders. I wont elaborate here as
its a bit off topic.

Try to reorders credits/debits on your bank account, I am pretty sure
you'll lose some money or even get serious troubles.

Of course, enabling GRO on eth0 would definitely help a bit...

(once/iff veth driver features are fixed to allow GSO packets being
forwarded without being segmented again)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Vagin - Dec. 29, 2012, 9:15 p.m.
On Sat, Dec 29, 2012 at 07:58:36PM +0100, Eric Dumazet wrote:
> Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> 
> > 
> > Please post your new tcpdump then ;)
> > 
> > also post "netstat -s" from root and test ns after your wgets
> 
> Also try following bnx2 patch.
> 
> It should help GRO / TCP coalesce
> 
> bnx2 should be the last driver not using skb head_frag
> 

This patch breaks nothing. I don't know what kind of profit I should get
with it:).

FYI:
I forgot to say, that I disable gro before collecting tcpdump, because
in this case tcpdump from veth and from eth0 can be compared easier.

> 
> diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
> index a1adfaf..08a2d40 100644
> --- a/drivers/net/ethernet/broadcom/bnx2.c
> +++ b/drivers/net/ethernet/broadcom/bnx2.c
> @@ -2726,6 +2726,14 @@ bnx2_free_rx_page(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index)
>  	rx_pg->page = NULL;
>  }
>  
> +static void bnx2_frag_free(const struct bnx2 *bp, void *data)
> +{
> +	if (bp->rx_frag_size)
> +		put_page(virt_to_head_page(data));
> +	else
> +		kfree(data);
> +}
> +
>  static inline int
>  bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gfp_t gfp)
>  {
> @@ -2735,7 +2743,10 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf
>  	struct bnx2_rx_bd *rxbd =
>  		&rxr->rx_desc_ring[BNX2_RX_RING(index)][BNX2_RX_IDX(index)];
>  
> -	data = kmalloc(bp->rx_buf_size, gfp);
> +	if (bp->rx_frag_size)
> +		data = netdev_alloc_frag(bp->rx_frag_size);
> +	else
> +		data = kmalloc(bp->rx_buf_size, gfp);
>  	if (!data)
>  		return -ENOMEM;
>  
> @@ -2744,7 +2755,7 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf
>  				 bp->rx_buf_use_size,
>  				 PCI_DMA_FROMDEVICE);
>  	if (dma_mapping_error(&bp->pdev->dev, mapping)) {
> -		kfree(data);
> +		bnx2_frag_free(bp, data);
>  		return -EIO;
>  	}
>  
> @@ -3014,9 +3025,9 @@ error:
>  
>  	dma_unmap_single(&bp->pdev->dev, dma_addr, bp->rx_buf_use_size,
>  			 PCI_DMA_FROMDEVICE);
> -	skb = build_skb(data, 0);
> +	skb = build_skb(data, bp->rx_frag_size);
>  	if (!skb) {
> -		kfree(data);
> +		bnx2_frag_free(bp, data);
>  		goto error;
>  	}
>  	skb_reserve(skb, ((u8 *)get_l2_fhdr(data) - data) + BNX2_RX_OFFSET);
> @@ -5358,6 +5369,10 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
>  	/* hw alignment + build_skb() overhead*/
>  	bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) +
>  		NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	if (bp->rx_buf_size <= PAGE_SIZE)
> +		bp->rx_frag_size = bp->rx_buf_size;
> +	else
> +		bp->rx_frag_size = 0;
>  	bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET;
>  	bp->rx_ring_size = size;
>  	bp->rx_max_ring = bnx2_find_max_ring(size, BNX2_MAX_RX_RINGS);
> @@ -5436,7 +5451,7 @@ bnx2_free_rx_skbs(struct bnx2 *bp)
>  
>  			rx_buf->data = NULL;
>  
> -			kfree(data);
> +			bnx2_frag_free(bp, data);
>  		}
>  		for (j = 0; j < bp->rx_max_pg_ring_idx; j++)
>  			bnx2_free_rx_page(bp, rxr, j);
> diff --git a/drivers/net/ethernet/broadcom/bnx2.h b/drivers/net/ethernet/broadcom/bnx2.h
> index 172efbe..11f5dee 100644
> --- a/drivers/net/ethernet/broadcom/bnx2.h
> +++ b/drivers/net/ethernet/broadcom/bnx2.h
> @@ -6804,6 +6804,7 @@ struct bnx2 {
>  
>  	u32			rx_buf_use_size;	/* useable size */
>  	u32			rx_buf_size;		/* with alignment */
> +	u32			rx_frag_size; /* 0 if kmalloced(), or rx_buf_size */
>  	u32			rx_copy_thresh;
>  	u32			rx_jumbo_thresh;
>  	u32			rx_max_ring_idx;
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Vagin - Dec. 29, 2012, 9:19 p.m.
On Sat, Dec 29, 2012 at 01:12:26PM -0800, Eric Dumazet wrote:
> On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote:
> 
> > Is it right, that a received window will be less, if packets are not sorted?
> > Looks like a bug.
> 
> Not really a bug.
> 
> TCP is very sensitive to packet reorders. I wont elaborate here as
> its a bit off topic.
> 
> Try to reorders credits/debits on your bank account, I am pretty sure
> you'll lose some money or even get serious troubles.
> 
> Of course, enabling GRO on eth0 would definitely help a bit...
> 
> (once/iff veth driver features are fixed to allow GSO packets being
> forwarded without being segmented again)
> 

Eric, thank you for the help.
I need time for thinking. I will ask you, if new questions will appear.

> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
index a1adfaf..08a2d40 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -2726,6 +2726,14 @@  bnx2_free_rx_page(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index)
 	rx_pg->page = NULL;
 }
 
+static void bnx2_frag_free(const struct bnx2 *bp, void *data)
+{
+	if (bp->rx_frag_size)
+		put_page(virt_to_head_page(data));
+	else
+		kfree(data);
+}
+
 static inline int
 bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gfp_t gfp)
 {
@@ -2735,7 +2743,10 @@  bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf
 	struct bnx2_rx_bd *rxbd =
 		&rxr->rx_desc_ring[BNX2_RX_RING(index)][BNX2_RX_IDX(index)];
 
-	data = kmalloc(bp->rx_buf_size, gfp);
+	if (bp->rx_frag_size)
+		data = netdev_alloc_frag(bp->rx_frag_size);
+	else
+		data = kmalloc(bp->rx_buf_size, gfp);
 	if (!data)
 		return -ENOMEM;
 
@@ -2744,7 +2755,7 @@  bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf
 				 bp->rx_buf_use_size,
 				 PCI_DMA_FROMDEVICE);
 	if (dma_mapping_error(&bp->pdev->dev, mapping)) {
-		kfree(data);
+		bnx2_frag_free(bp, data);
 		return -EIO;
 	}
 
@@ -3014,9 +3025,9 @@  error:
 
 	dma_unmap_single(&bp->pdev->dev, dma_addr, bp->rx_buf_use_size,
 			 PCI_DMA_FROMDEVICE);
-	skb = build_skb(data, 0);
+	skb = build_skb(data, bp->rx_frag_size);
 	if (!skb) {
-		kfree(data);
+		bnx2_frag_free(bp, data);
 		goto error;
 	}
 	skb_reserve(skb, ((u8 *)get_l2_fhdr(data) - data) + BNX2_RX_OFFSET);
@@ -5358,6 +5369,10 @@  bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
 	/* hw alignment + build_skb() overhead*/
 	bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) +
 		NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	if (bp->rx_buf_size <= PAGE_SIZE)
+		bp->rx_frag_size = bp->rx_buf_size;
+	else
+		bp->rx_frag_size = 0;
 	bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET;
 	bp->rx_ring_size = size;
 	bp->rx_max_ring = bnx2_find_max_ring(size, BNX2_MAX_RX_RINGS);
@@ -5436,7 +5451,7 @@  bnx2_free_rx_skbs(struct bnx2 *bp)
 
 			rx_buf->data = NULL;
 
-			kfree(data);
+			bnx2_frag_free(bp, data);
 		}
 		for (j = 0; j < bp->rx_max_pg_ring_idx; j++)
 			bnx2_free_rx_page(bp, rxr, j);
diff --git a/drivers/net/ethernet/broadcom/bnx2.h b/drivers/net/ethernet/broadcom/bnx2.h
index 172efbe..11f5dee 100644
--- a/drivers/net/ethernet/broadcom/bnx2.h
+++ b/drivers/net/ethernet/broadcom/bnx2.h
@@ -6804,6 +6804,7 @@  struct bnx2 {
 
 	u32			rx_buf_use_size;	/* useable size */
 	u32			rx_buf_size;		/* with alignment */
+	u32			rx_frag_size; /* 0 if kmalloced(), or rx_buf_size */
 	u32			rx_copy_thresh;
 	u32			rx_jumbo_thresh;
 	u32			rx_max_ring_idx;