Message ID | 1356807516.4102.4.camel@edumazet-laptop |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote: > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit : > > > > > Please post your new tcpdump then ;) > > > > also post "netstat -s" from root and test ns after your wgets > > Also try following bnx2 patch. > > It should help GRO / TCP coalesce > > bnx2 should be the last driver not using skb head_frag And of course, you should make sure all your bnx2 interrupts are handled by the same cpu. Or else, packets might be reordered because the way dev_forward_skb() works. (CPU X gets a bunch of packets from eth0, forward them via netif_rx() in the local CPU X queue, NAPI is ended on eth0) CPU Y gets a bunch of packets from eth0, forward them via netif_rx() in the local CPU Y queue. CPU X and Y process their local queue in // -> packets are delivered Out of order to TCP stack Alternative is to setup RPS on your veth1 device, to force packets being delivered/handled by a given cpu -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote: > On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote: > > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit : > > > > > > > > Please post your new tcpdump then ;) > > > > > > also post "netstat -s" from root and test ns after your wgets > > > > Also try following bnx2 patch. > > > > It should help GRO / TCP coalesce > > > > bnx2 should be the last driver not using skb head_frag I don't have access to the host. I'm going to test your patch tomorrow. Thanks. > > And of course, you should make sure all your bnx2 interrupts are handled > by the same cpu. All bnx interrupts are handled on all cpus. They are handled on the same cpu, if a kernel is booted with msi_disable=1. Is it right, that a received window will be less, if packets are not sorted? Looks like a bug. I want to say, that probably it works correctly, if packets are sorted. But I think if packets are not sorted, it should work with the same speed, cpu load and memory consumption may be a bit more. > > Or else, packets might be reordered because the way dev_forward_skb() > works. > > (CPU X gets a bunch of packets from eth0, forward them via netif_rx() in > the local CPU X queue, NAPI is ended on eth0) > > CPU Y gets a bunch of packets from eth0, forward them via netif_rx() in > the local CPU Y queue. > > CPU X and Y process their local queue in // -> packets are delivered Out > of order to TCP stack > > Alternative is to setup RPS on your veth1 device, to force packets being > delivered/handled by a given cpu > > > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote: > On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote: > > On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote: > > > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit : > > > > > > > > > > > Please post your new tcpdump then ;) > > > > > > > > also post "netstat -s" from root and test ns after your wgets > > > > > > Also try following bnx2 patch. > > > > > > It should help GRO / TCP coalesce > > > > > > bnx2 should be the last driver not using skb head_frag > > I don't have access to the host. I'm going to test your patch tomorrow. > Thanks. > > > > > And of course, you should make sure all your bnx2 interrupts are handled > > by the same cpu. > All bnx interrupts are handled on all cpus. They are handled on the same > cpu, if a kernel is booted with msi_disable=1. > > Is it right, that a received window will be less, if packets are not sorted? > Looks like a bug. > > I want to say, that probably it works correctly, if packets are sorted. > But I think if packets are not sorted, it should work with the same > speed, cpu load and memory consumption may be a bit more. Without veth, it doesnt really matter that IRQ are spread on multiple cpus, because packets are handled in NAPI, and only one cpu runs the eth0 NAPI handler at one time. But as soon as packets are queued (by netif_rx()) for 'later' processing, you can have dramatic performance decrease. Thats why you really should make sure IRQ on your eth0 device are handled by a single cpu. It will help to get better performance in most cases. echo 1 >/proc/irq/*/eth0/../smp_affinity If it doesnt work, you might try instead : echo 1 >/proc/irq/default_smp_affinity <you might need to reload bnx2 module, or ifdown/ifup eth0 > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Dec 29, 2012 at 12:20:07PM -0800, Eric Dumazet wrote: > On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote: > > On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote: > > > On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote: > > > > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit : > > > > > > > > > > > > > > Please post your new tcpdump then ;) > > > > > > > > > > also post "netstat -s" from root and test ns after your wgets > > > > > > > > Also try following bnx2 patch. > > > > > > > > It should help GRO / TCP coalesce > > > > > > > > bnx2 should be the last driver not using skb head_frag > > > > I don't have access to the host. I'm going to test your patch tomorrow. > > Thanks. > > > > > > > > And of course, you should make sure all your bnx2 interrupts are handled > > > by the same cpu. > > All bnx interrupts are handled on all cpus. They are handled on the same > > cpu, if a kernel is booted with msi_disable=1. > > > > Is it right, that a received window will be less, if packets are not sorted? > > Looks like a bug. > > > > I want to say, that probably it works correctly, if packets are sorted. > > But I think if packets are not sorted, it should work with the same > > speed, cpu load and memory consumption may be a bit more. > > Without veth, it doesnt really matter that IRQ are spread on multiple > cpus, because packets are handled in NAPI, and only one cpu runs the > eth0 NAPI handler at one time. > > But as soon as packets are queued (by netif_rx()) for 'later' > processing, you can have dramatic performance decrease. > > Thats why you really should make sure IRQ on your eth0 device > are handled by a single cpu. > > It will help to get better performance in most cases. I understand this fact, but so big difference looks strange for me. Default configuration (with the bug): # cat /proc/interrupts | grep eth0 68: 10187 10188 10187 10023 10190 10185 10187 10019 PCI-MSI-edge eth0 > > echo 1 >/proc/irq/*/eth0/../smp_affinity This doesn't help. I tryed echo 0 > /proc/irq/68/smp_affinity_list. This doesn't help too. > > If it doesnt work, you might try instead : > > echo 1 >/proc/irq/default_smp_affinity > <you might need to reload bnx2 module, or ifdown/ifup eth0 > This helps, and the bug are not reproduced in this case. # cat /proc/interrupts | grep eth0 68: 60777 0 0 0 0 0 0 0 PCI-MSI-edge eth0 Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote: > Is it right, that a received window will be less, if packets are not sorted? > Looks like a bug. Not really a bug. TCP is very sensitive to packet reorders. I wont elaborate here as its a bit off topic. Try to reorders credits/debits on your bank account, I am pretty sure you'll lose some money or even get serious troubles. Of course, enabling GRO on eth0 would definitely help a bit... (once/iff veth driver features are fixed to allow GSO packets being forwarded without being segmented again) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Dec 29, 2012 at 07:58:36PM +0100, Eric Dumazet wrote: > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit : > > > > > Please post your new tcpdump then ;) > > > > also post "netstat -s" from root and test ns after your wgets > > Also try following bnx2 patch. > > It should help GRO / TCP coalesce > > bnx2 should be the last driver not using skb head_frag > This patch breaks nothing. I don't know what kind of profit I should get with it:). FYI: I forgot to say, that I disable gro before collecting tcpdump, because in this case tcpdump from veth and from eth0 can be compared easier. > > diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c > index a1adfaf..08a2d40 100644 > --- a/drivers/net/ethernet/broadcom/bnx2.c > +++ b/drivers/net/ethernet/broadcom/bnx2.c > @@ -2726,6 +2726,14 @@ bnx2_free_rx_page(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index) > rx_pg->page = NULL; > } > > +static void bnx2_frag_free(const struct bnx2 *bp, void *data) > +{ > + if (bp->rx_frag_size) > + put_page(virt_to_head_page(data)); > + else > + kfree(data); > +} > + > static inline int > bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gfp_t gfp) > { > @@ -2735,7 +2743,10 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf > struct bnx2_rx_bd *rxbd = > &rxr->rx_desc_ring[BNX2_RX_RING(index)][BNX2_RX_IDX(index)]; > > - data = kmalloc(bp->rx_buf_size, gfp); > + if (bp->rx_frag_size) > + data = netdev_alloc_frag(bp->rx_frag_size); > + else > + data = kmalloc(bp->rx_buf_size, gfp); > if (!data) > return -ENOMEM; > > @@ -2744,7 +2755,7 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf > bp->rx_buf_use_size, > PCI_DMA_FROMDEVICE); > if (dma_mapping_error(&bp->pdev->dev, mapping)) { > - kfree(data); > + bnx2_frag_free(bp, data); > return -EIO; > } > > @@ -3014,9 +3025,9 @@ error: > > dma_unmap_single(&bp->pdev->dev, dma_addr, bp->rx_buf_use_size, > PCI_DMA_FROMDEVICE); > - skb = build_skb(data, 0); > + skb = build_skb(data, bp->rx_frag_size); > if (!skb) { > - kfree(data); > + bnx2_frag_free(bp, data); > goto error; > } > skb_reserve(skb, ((u8 *)get_l2_fhdr(data) - data) + BNX2_RX_OFFSET); > @@ -5358,6 +5369,10 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size) > /* hw alignment + build_skb() overhead*/ > bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) + > NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > + if (bp->rx_buf_size <= PAGE_SIZE) > + bp->rx_frag_size = bp->rx_buf_size; > + else > + bp->rx_frag_size = 0; > bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET; > bp->rx_ring_size = size; > bp->rx_max_ring = bnx2_find_max_ring(size, BNX2_MAX_RX_RINGS); > @@ -5436,7 +5451,7 @@ bnx2_free_rx_skbs(struct bnx2 *bp) > > rx_buf->data = NULL; > > - kfree(data); > + bnx2_frag_free(bp, data); > } > for (j = 0; j < bp->rx_max_pg_ring_idx; j++) > bnx2_free_rx_page(bp, rxr, j); > diff --git a/drivers/net/ethernet/broadcom/bnx2.h b/drivers/net/ethernet/broadcom/bnx2.h > index 172efbe..11f5dee 100644 > --- a/drivers/net/ethernet/broadcom/bnx2.h > +++ b/drivers/net/ethernet/broadcom/bnx2.h > @@ -6804,6 +6804,7 @@ struct bnx2 { > > u32 rx_buf_use_size; /* useable size */ > u32 rx_buf_size; /* with alignment */ > + u32 rx_frag_size; /* 0 if kmalloced(), or rx_buf_size */ > u32 rx_copy_thresh; > u32 rx_jumbo_thresh; > u32 rx_max_ring_idx; > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Dec 29, 2012 at 01:12:26PM -0800, Eric Dumazet wrote: > On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote: > > > Is it right, that a received window will be less, if packets are not sorted? > > Looks like a bug. > > Not really a bug. > > TCP is very sensitive to packet reorders. I wont elaborate here as > its a bit off topic. > > Try to reorders credits/debits on your bank account, I am pretty sure > you'll lose some money or even get serious troubles. > > Of course, enabling GRO on eth0 would definitely help a bit... > > (once/iff veth driver features are fixed to allow GSO packets being > forwarded without being segmented again) > Eric, thank you for the help. I need time for thinking. I will ask you, if new questions will appear. > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c index a1adfaf..08a2d40 100644 --- a/drivers/net/ethernet/broadcom/bnx2.c +++ b/drivers/net/ethernet/broadcom/bnx2.c @@ -2726,6 +2726,14 @@ bnx2_free_rx_page(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index) rx_pg->page = NULL; } +static void bnx2_frag_free(const struct bnx2 *bp, void *data) +{ + if (bp->rx_frag_size) + put_page(virt_to_head_page(data)); + else + kfree(data); +} + static inline int bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gfp_t gfp) { @@ -2735,7 +2743,10 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf struct bnx2_rx_bd *rxbd = &rxr->rx_desc_ring[BNX2_RX_RING(index)][BNX2_RX_IDX(index)]; - data = kmalloc(bp->rx_buf_size, gfp); + if (bp->rx_frag_size) + data = netdev_alloc_frag(bp->rx_frag_size); + else + data = kmalloc(bp->rx_buf_size, gfp); if (!data) return -ENOMEM; @@ -2744,7 +2755,7 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf bp->rx_buf_use_size, PCI_DMA_FROMDEVICE); if (dma_mapping_error(&bp->pdev->dev, mapping)) { - kfree(data); + bnx2_frag_free(bp, data); return -EIO; } @@ -3014,9 +3025,9 @@ error: dma_unmap_single(&bp->pdev->dev, dma_addr, bp->rx_buf_use_size, PCI_DMA_FROMDEVICE); - skb = build_skb(data, 0); + skb = build_skb(data, bp->rx_frag_size); if (!skb) { - kfree(data); + bnx2_frag_free(bp, data); goto error; } skb_reserve(skb, ((u8 *)get_l2_fhdr(data) - data) + BNX2_RX_OFFSET); @@ -5358,6 +5369,10 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size) /* hw alignment + build_skb() overhead*/ bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) + NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + if (bp->rx_buf_size <= PAGE_SIZE) + bp->rx_frag_size = bp->rx_buf_size; + else + bp->rx_frag_size = 0; bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET; bp->rx_ring_size = size; bp->rx_max_ring = bnx2_find_max_ring(size, BNX2_MAX_RX_RINGS); @@ -5436,7 +5451,7 @@ bnx2_free_rx_skbs(struct bnx2 *bp) rx_buf->data = NULL; - kfree(data); + bnx2_frag_free(bp, data); } for (j = 0; j < bp->rx_max_pg_ring_idx; j++) bnx2_free_rx_page(bp, rxr, j); diff --git a/drivers/net/ethernet/broadcom/bnx2.h b/drivers/net/ethernet/broadcom/bnx2.h index 172efbe..11f5dee 100644 --- a/drivers/net/ethernet/broadcom/bnx2.h +++ b/drivers/net/ethernet/broadcom/bnx2.h @@ -6804,6 +6804,7 @@ struct bnx2 { u32 rx_buf_use_size; /* useable size */ u32 rx_buf_size; /* with alignment */ + u32 rx_frag_size; /* 0 if kmalloced(), or rx_buf_size */ u32 rx_copy_thresh; u32 rx_jumbo_thresh; u32 rx_max_ring_idx;