Message ID | 20141202071304.GA22512@1wt.eu |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
On Tue, 2014-12-02 at 08:13 +0100, Willy Tarreau wrote: > The mvneta driver sets the amount of Tx coalesce packets to 16 by > default. Normally that does not cause any trouble since the driver > uses a much larger Tx ring size (532 packets). But some sockets > might run with very small buffers, much smaller than the equivalent > of 16 packets. This is what ping is doing for example, by setting > SNDBUF to 324 bytes rounded up to 2kB by the kernel. > > The problem is that there is no documented method to force a specific > packet to emit an interrupt (eg: the last of the ring) nor is it > possible to make the NIC emit an interrupt after a given delay. > > In this case, it causes trouble, because when ping sends packets over > its raw socket, the few first packets leave the system, and the first > 15 packets will be emitted without an IRQ being generated, so without > the skbs being freed. And since the socket's buffer is small, there's > no way to reach that amount of packets, and the ping ends up with > "send: no buffer available" after sending 6 packets. Running with 3 > instances of ping in parallel is enough to hide the problem, because > with 6 packets per instance, that's 18 packets total, which is enough > to grant a Tx interrupt before all are sent. > > The original driver in the LSP kernel worked around this design flaw > by using a software timer to clean up the Tx descriptors. This timer > was slow and caused terrible network performance on some Tx-bound > workloads (such as routing) but was enough to make tools like ping > work correctly. > > Instead here, we simply set the packet counts before interrupt to 1. > This ensures that each packet sent will produce an interrupt. NAPI > takes care of coalescing interrupts since the interrupt is disabled > once generated. > > No measurable performance impact nor CPU usage were observed on small > nor large packets, including when saturating the link on Tx, and this > fixes tools like ping which rely on too small a send buffer. If one > wants to increase this value for certain workloads where it is safe > to do so, "ethtool -C $dev tx-frames" will override this default > setting. > > This fix needs to be applied to stable kernels starting with 3.10. > > Tested-By: Maggie Mae Roxas <maggie.mae.roxas@gmail.com> > Signed-off-by: Willy Tarreau <w@1wt.eu> > > --- > drivers/net/ethernet/marvell/mvneta.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c > index 4762994..35bfba7 100644 > --- a/drivers/net/ethernet/marvell/mvneta.c > +++ b/drivers/net/ethernet/marvell/mvneta.c > @@ -214,7 +214,7 @@ > /* Various constants */ > > /* Coalescing */ > -#define MVNETA_TXDONE_COAL_PKTS 16 > +#define MVNETA_TXDONE_COAL_PKTS 1 > #define MVNETA_RX_COAL_PKTS 32 > #define MVNETA_RX_COAL_USEC 100 > I am surprised TCP even worked correctly with this problem. I highly suggest BQL for this driver, now this issue is fixed. I wonder if this high setting was not because of race conditions in the driver : mvneta_tx() seems to access skb->len too late, TX completion might have already freed skb : u64_stats_update_begin(&stats->syncp); stats->tx_packets++; stats->tx_bytes += skb->len; // potential use after free u64_stats_update_end(&stats->syncp); Thanks Willy ! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Eric, On Tue, Dec 02, 2014 at 04:18:08AM -0800, Eric Dumazet wrote: > > diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c > > index 4762994..35bfba7 100644 > > --- a/drivers/net/ethernet/marvell/mvneta.c > > +++ b/drivers/net/ethernet/marvell/mvneta.c > > @@ -214,7 +214,7 @@ > > /* Various constants */ > > > > /* Coalescing */ > > -#define MVNETA_TXDONE_COAL_PKTS 16 > > +#define MVNETA_TXDONE_COAL_PKTS 1 > > #define MVNETA_RX_COAL_PKTS 32 > > #define MVNETA_RX_COAL_USEC 100 > > > > > I am surprised TCP even worked correctly with this problem. There are multiple explanations to this : - we used to flush Tx queues upon Rx interrupt, which used to hide the problem. - we tend to have large socket buffers, which cover the issue. I've never had any issue at high data rates. After all only 23kB send buffer is needed to get it to work. - most often you have multiple parallel streams which hide the issue even more. > I highly suggest BQL for this driver, now this issue is fixed. How does that work ? > I wonder if this high setting was not because of race conditions in the > driver : > > mvneta_tx() seems to access skb->len too late, TX completion might have > already freed skb : > > u64_stats_update_begin(&stats->syncp); > stats->tx_packets++; > stats->tx_bytes += skb->len; // potential use after free > u64_stats_update_end(&stats->syncp); Good catch! But no, this is unrelated since it does not fix the race either. Initially this driver used to implement a timer to flush the Tx queue after 10ms, resulting in abysmal Tx-only performance as you can easily imagine. In my opinion there's a design flaw in the chip, and they did everything they could to workaround it (let's not say "hide it"), but that was not enough. When I "fixed" the performance issue by enabling the Tx interrupt, I kept the value 16 which gave pretty good results to me, without realizing that there was this corner case :-/ > Thanks Willy ! Thanks for your review :-) Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2014-12-02 at 13:39 +0100, Willy Tarreau wrote: > Hi Eric, > > On Tue, Dec 02, 2014 at 04:18:08AM -0800, Eric Dumazet wrote: > > > I highly suggest BQL for this driver, now this issue is fixed. > > How does that work ? This works very well ;) Its super easy to implement, take a look at commits bdbc063129e811264cd6c311d8c2d9b95de01231 or 7070ce0a6419a118842298bc967061ad6cea40db -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 02, 2014 at 05:04:18AM -0800, Eric Dumazet wrote: > On Tue, 2014-12-02 at 13:39 +0100, Willy Tarreau wrote: > > Hi Eric, > > > > On Tue, Dec 02, 2014 at 04:18:08AM -0800, Eric Dumazet wrote: > > > > > I highly suggest BQL for this driver, now this issue is fixed. > > > > How does that work ? > > This works very well ;) > > Its super easy to implement, take a look at commits > bdbc063129e811264cd6c311d8c2d9b95de01231 or > 7070ce0a6419a118842298bc967061ad6cea40db Thanks but I'm not sure I entirely understand the concept. Is it to notify the sender that the packets were already queued for the NIC ? And if so, how does that improve the situation ? I'm sorry if this sounds like a stupid question, it's just that the concept by itself is not clear to me. Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2014-12-02 at 14:24 +0100, Willy Tarreau wrote: > Thanks but I'm not sure I entirely understand the concept. Is it to > notify the sender that the packets were already queued for the NIC ? > And if so, how does that improve the situation ? I'm sorry if this > sounds like a stupid question, it's just that the concept by itself > is not clear to me. http://lwn.net/Articles/454390/ BQL is first step to fight so called bufferbloat. https://www.ietf.org/proceedings/86/slides/slides-86-iccrg-0.pdf -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric, On 12/02/2014 09:18 AM, Eric Dumazet wrote: [..] > > I am surprised TCP even worked correctly with this problem. > > I highly suggest BQL for this driver, now this issue is fixed. > Implementing BQL for the mvneta driver was something I wanted to do a while ago, but you explained that these kind drivers (i.e. those with software TSO) didn't need BQL, because the latency that resulted from the ring was too small. Quoting (http://www.spinics.net/lists/netdev/msg284439.html): "" Note that a full size TSO packet (44 or 45 MSS) requires about 88 or 90 descriptors. So I do not think BQL is really needed, because a 512 slots TX ring wont add a big latency : About 5 ms max. BQL is generally nice for NIC supporting hardware TSO, where a 64KB TSO packet consumes 3 or 4 descriptors. Also note that TCP Small Queues should limit TX ring occupancy of a single bulk flow anyway. "" Maybe I misunderstood something?
On Tue, Dec 2, 2014 at 9:12 AM, Ezequiel Garcia <ezequiel.garcia@free-electrons.com> wrote: > Eric, > > On 12/02/2014 09:18 AM, Eric Dumazet wrote: > [..] >> >> I am surprised TCP even worked correctly with this problem. >> >> I highly suggest BQL for this driver, now this issue is fixed. >> > > Implementing BQL for the mvneta driver was something I wanted to do a > while ago, but you explained that these kind drivers (i.e. those with > software TSO) didn't need BQL, because the latency that resulted from > the ring was too small. > > Quoting (http://www.spinics.net/lists/netdev/msg284439.html): > "" > Note that a full size TSO packet (44 or 45 MSS) requires about 88 or 90 > descriptors. > > So I do not think BQL is really needed, because a 512 slots TX ring wont > add a big latency : About 5 ms max. at a gigabit. Wildly variable with pause frames. And: 50ms at 100mbit. 500ms at 10mbit. Secondly, why accept any latency in the driver when you can get it well below 2ms in most circumstances with a couple lines of BQL code? > > BQL is generally nice for NIC supporting hardware TSO, where a 64KB TSO > packet consumes 3 or 4 descriptors. BQL is nice at any speed, and darn useful at lower speeds, as well. > > Also note that TCP Small Queues should limit TX ring occupancy of a > single bulk flow anyway. We only do one upload or one download at a time on our systems, followed by a ping measurement... never do more than one flow at the same time, right? > "" > > Maybe I misunderstood something? Try it, test it, you'll like it. > -- > Ezequiel GarcĂa, Free Electrons > Embedded Linux, Kernel and Android Engineering > http://free-electrons.com > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2014-12-02 at 14:12 -0300, Ezequiel Garcia wrote: > Eric, > > On 12/02/2014 09:18 AM, Eric Dumazet wrote: > [..] > > > > I am surprised TCP even worked correctly with this problem. > > > > I highly suggest BQL for this driver, now this issue is fixed. > > > > Implementing BQL for the mvneta driver was something I wanted to do a > while ago, but you explained that these kind drivers (i.e. those with > software TSO) didn't need BQL, because the latency that resulted from > the ring was too small. > > Quoting (http://www.spinics.net/lists/netdev/msg284439.html): > "" > Note that a full size TSO packet (44 or 45 MSS) requires about 88 or 90 > descriptors. > > So I do not think BQL is really needed, because a 512 slots TX ring wont > add a big latency : About 5 ms max. > > BQL is generally nice for NIC supporting hardware TSO, where a 64KB TSO > packet consumes 3 or 4 descriptors. > > Also note that TCP Small Queues should limit TX ring occupancy of a > single bulk flow anyway. > "" > > Maybe I misunderstood something? This was indeed the case, but we added recently xmit_more support, and it uses BQL information. So you might add BQL anyway, if xmit_more support is useful for this hardware. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Willy Tarreau <w@1wt.eu> Date: Tue, 2 Dec 2014 08:13:04 +0100 > The mvneta driver sets the amount of Tx coalesce packets to 16 by > default. Normally that does not cause any trouble since the driver > uses a much larger Tx ring size (532 packets). But some sockets > might run with very small buffers, much smaller than the equivalent > of 16 packets. This is what ping is doing for example, by setting > SNDBUF to 324 bytes rounded up to 2kB by the kernel. > > The problem is that there is no documented method to force a specific > packet to emit an interrupt (eg: the last of the ring) nor is it > possible to make the NIC emit an interrupt after a given delay. > > In this case, it causes trouble, because when ping sends packets over > its raw socket, the few first packets leave the system, and the first > 15 packets will be emitted without an IRQ being generated, so without > the skbs being freed. And since the socket's buffer is small, there's > no way to reach that amount of packets, and the ping ends up with > "send: no buffer available" after sending 6 packets. Running with 3 > instances of ping in parallel is enough to hide the problem, because > with 6 packets per instance, that's 18 packets total, which is enough > to grant a Tx interrupt before all are sent. > > The original driver in the LSP kernel worked around this design flaw > by using a software timer to clean up the Tx descriptors. This timer > was slow and caused terrible network performance on some Tx-bound > workloads (such as routing) but was enough to make tools like ping > work correctly. > > Instead here, we simply set the packet counts before interrupt to 1. > This ensures that each packet sent will produce an interrupt. NAPI > takes care of coalescing interrupts since the interrupt is disabled > once generated. > > No measurable performance impact nor CPU usage were observed on small > nor large packets, including when saturating the link on Tx, and this > fixes tools like ping which rely on too small a send buffer. If one > wants to increase this value for certain workloads where it is safe > to do so, "ethtool -C $dev tx-frames" will override this default > setting. > > This fix needs to be applied to stable kernels starting with 3.10. > > Tested-By: Maggie Mae Roxas <maggie.mae.roxas@gmail.com> > Signed-off-by: Willy Tarreau <w@1wt.eu> Applied and queued up for -stable, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index 4762994..35bfba7 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -214,7 +214,7 @@ /* Various constants */ /* Coalescing */ -#define MVNETA_TXDONE_COAL_PKTS 16 +#define MVNETA_TXDONE_COAL_PKTS 1 #define MVNETA_RX_COAL_PKTS 32 #define MVNETA_RX_COAL_USEC 100