diff mbox

net: mvneta: fix Tx interrupt delay

Message ID 20141202071304.GA22512@1wt.eu
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Willy Tarreau Dec. 2, 2014, 7:13 a.m. UTC
The mvneta driver sets the amount of Tx coalesce packets to 16 by
default. Normally that does not cause any trouble since the driver
uses a much larger Tx ring size (532 packets). But some sockets
might run with very small buffers, much smaller than the equivalent
of 16 packets. This is what ping is doing for example, by setting
SNDBUF to 324 bytes rounded up to 2kB by the kernel.

The problem is that there is no documented method to force a specific
packet to emit an interrupt (eg: the last of the ring) nor is it
possible to make the NIC emit an interrupt after a given delay.

In this case, it causes trouble, because when ping sends packets over
its raw socket, the few first packets leave the system, and the first
15 packets will be emitted without an IRQ being generated, so without
the skbs being freed. And since the socket's buffer is small, there's
no way to reach that amount of packets, and the ping ends up with
"send: no buffer available" after sending 6 packets. Running with 3
instances of ping in parallel is enough to hide the problem, because
with 6 packets per instance, that's 18 packets total, which is enough
to grant a Tx interrupt before all are sent.

The original driver in the LSP kernel worked around this design flaw
by using a software timer to clean up the Tx descriptors. This timer
was slow and caused terrible network performance on some Tx-bound
workloads (such as routing) but was enough to make tools like ping
work correctly.

Instead here, we simply set the packet counts before interrupt to 1.
This ensures that each packet sent will produce an interrupt. NAPI
takes care of coalescing interrupts since the interrupt is disabled
once generated.

No measurable performance impact nor CPU usage were observed on small
nor large packets, including when saturating the link on Tx, and this
fixes tools like ping which rely on too small a send buffer. If one
wants to increase this value for certain workloads where it is safe
to do so, "ethtool -C $dev tx-frames" will override this default
setting.

This fix needs to be applied to stable kernels starting with 3.10.

Tested-By: Maggie Mae Roxas <maggie.mae.roxas@gmail.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

---
 drivers/net/ethernet/marvell/mvneta.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Eric Dumazet Dec. 2, 2014, 12:18 p.m. UTC | #1
On Tue, 2014-12-02 at 08:13 +0100, Willy Tarreau wrote:
> The mvneta driver sets the amount of Tx coalesce packets to 16 by
> default. Normally that does not cause any trouble since the driver
> uses a much larger Tx ring size (532 packets). But some sockets
> might run with very small buffers, much smaller than the equivalent
> of 16 packets. This is what ping is doing for example, by setting
> SNDBUF to 324 bytes rounded up to 2kB by the kernel.
> 
> The problem is that there is no documented method to force a specific
> packet to emit an interrupt (eg: the last of the ring) nor is it
> possible to make the NIC emit an interrupt after a given delay.
> 
> In this case, it causes trouble, because when ping sends packets over
> its raw socket, the few first packets leave the system, and the first
> 15 packets will be emitted without an IRQ being generated, so without
> the skbs being freed. And since the socket's buffer is small, there's
> no way to reach that amount of packets, and the ping ends up with
> "send: no buffer available" after sending 6 packets. Running with 3
> instances of ping in parallel is enough to hide the problem, because
> with 6 packets per instance, that's 18 packets total, which is enough
> to grant a Tx interrupt before all are sent.
> 
> The original driver in the LSP kernel worked around this design flaw
> by using a software timer to clean up the Tx descriptors. This timer
> was slow and caused terrible network performance on some Tx-bound
> workloads (such as routing) but was enough to make tools like ping
> work correctly.
> 
> Instead here, we simply set the packet counts before interrupt to 1.
> This ensures that each packet sent will produce an interrupt. NAPI
> takes care of coalescing interrupts since the interrupt is disabled
> once generated.
> 
> No measurable performance impact nor CPU usage were observed on small
> nor large packets, including when saturating the link on Tx, and this
> fixes tools like ping which rely on too small a send buffer. If one
> wants to increase this value for certain workloads where it is safe
> to do so, "ethtool -C $dev tx-frames" will override this default
> setting.
> 
> This fix needs to be applied to stable kernels starting with 3.10.
> 
> Tested-By: Maggie Mae Roxas <maggie.mae.roxas@gmail.com>
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> 
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index 4762994..35bfba7 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -214,7 +214,7 @@
>  /* Various constants */
>  
>  /* Coalescing */
> -#define MVNETA_TXDONE_COAL_PKTS		16
> +#define MVNETA_TXDONE_COAL_PKTS		1
>  #define MVNETA_RX_COAL_PKTS		32
>  #define MVNETA_RX_COAL_USEC		100
>  


I am surprised TCP even worked correctly with this problem.

I highly suggest BQL for this driver, now this issue is fixed.

I wonder if this high setting was not because of race conditions in the
driver :

mvneta_tx() seems to access skb->len too late, TX completion might have
already freed skb :

                u64_stats_update_begin(&stats->syncp);
                stats->tx_packets++;
                stats->tx_bytes  += skb->len;         // potential use after free
                u64_stats_update_end(&stats->syncp);

Thanks Willy !

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Dec. 2, 2014, 12:39 p.m. UTC | #2
Hi Eric,

On Tue, Dec 02, 2014 at 04:18:08AM -0800, Eric Dumazet wrote:
> > diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> > index 4762994..35bfba7 100644
> > --- a/drivers/net/ethernet/marvell/mvneta.c
> > +++ b/drivers/net/ethernet/marvell/mvneta.c
> > @@ -214,7 +214,7 @@
> >  /* Various constants */
> >  
> >  /* Coalescing */
> > -#define MVNETA_TXDONE_COAL_PKTS		16
> > +#define MVNETA_TXDONE_COAL_PKTS		1
> >  #define MVNETA_RX_COAL_PKTS		32
> >  #define MVNETA_RX_COAL_USEC		100
> >  
> 
> 
> I am surprised TCP even worked correctly with this problem.

There are multiple explanations to this :
  - we used to flush Tx queues upon Rx interrupt, which used to hide
    the problem.

  - we tend to have large socket buffers, which cover the issue. I've
    never had any issue at high data rates. After all only 23kB send
    buffer is needed to get it to work.

  - most often you have multiple parallel streams which hide the issue
    even more.

> I highly suggest BQL for this driver, now this issue is fixed.

How does that work ?

> I wonder if this high setting was not because of race conditions in the
> driver :
>
> mvneta_tx() seems to access skb->len too late, TX completion might have
> already freed skb :
> 
>                 u64_stats_update_begin(&stats->syncp);
>                 stats->tx_packets++;
>                 stats->tx_bytes  += skb->len;         // potential use after free
>                 u64_stats_update_end(&stats->syncp);

Good catch! But no, this is unrelated since it does not fix the race either.
Initially this driver used to implement a timer to flush the Tx queue after
10ms, resulting in abysmal Tx-only performance as you can easily imagine. In
my opinion there's a design flaw in the chip, and they did everything they
could to workaround it (let's not say "hide it"), but that was not enough.
When I "fixed" the performance issue by enabling the Tx interrupt, I kept
the value 16 which gave pretty good results to me, without realizing that
there was this corner case :-/

> Thanks Willy !

Thanks for your review :-)

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 2, 2014, 1:04 p.m. UTC | #3
On Tue, 2014-12-02 at 13:39 +0100, Willy Tarreau wrote:
> Hi Eric,
> 
> On Tue, Dec 02, 2014 at 04:18:08AM -0800, Eric Dumazet wrote:
> 
> > I highly suggest BQL for this driver, now this issue is fixed.
> 
> How does that work ?

This works very well ;)

Its super easy to implement, take a look at commits
bdbc063129e811264cd6c311d8c2d9b95de01231 or
7070ce0a6419a118842298bc967061ad6cea40db


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Dec. 2, 2014, 1:24 p.m. UTC | #4
On Tue, Dec 02, 2014 at 05:04:18AM -0800, Eric Dumazet wrote:
> On Tue, 2014-12-02 at 13:39 +0100, Willy Tarreau wrote:
> > Hi Eric,
> > 
> > On Tue, Dec 02, 2014 at 04:18:08AM -0800, Eric Dumazet wrote:
> > 
> > > I highly suggest BQL for this driver, now this issue is fixed.
> > 
> > How does that work ?
> 
> This works very well ;)
> 
> Its super easy to implement, take a look at commits
> bdbc063129e811264cd6c311d8c2d9b95de01231 or
> 7070ce0a6419a118842298bc967061ad6cea40db

Thanks but I'm not sure I entirely understand the concept. Is it to
notify the sender that the packets were already queued for the NIC ?
And if so, how does that improve the situation ? I'm sorry if this
sounds like a stupid question, it's just that the concept by itself
is not clear to me.

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 2, 2014, 3:17 p.m. UTC | #5
On Tue, 2014-12-02 at 14:24 +0100, Willy Tarreau wrote:

> Thanks but I'm not sure I entirely understand the concept. Is it to
> notify the sender that the packets were already queued for the NIC ?
> And if so, how does that improve the situation ? I'm sorry if this
> sounds like a stupid question, it's just that the concept by itself
> is not clear to me.

http://lwn.net/Articles/454390/

BQL is first step to fight so called bufferbloat.

https://www.ietf.org/proceedings/86/slides/slides-86-iccrg-0.pdf


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ezequiel Garcia Dec. 2, 2014, 5:12 p.m. UTC | #6
Eric,

On 12/02/2014 09:18 AM, Eric Dumazet wrote:
[..]
> 
> I am surprised TCP even worked correctly with this problem.
> 
> I highly suggest BQL for this driver, now this issue is fixed.
> 

Implementing BQL for the mvneta driver was something I wanted to do a
while ago, but you explained that these kind drivers (i.e. those with
software TSO) didn't need BQL, because the latency that resulted from
the ring was too small.

Quoting (http://www.spinics.net/lists/netdev/msg284439.html):
""
Note that a full size TSO packet (44 or 45 MSS) requires about 88 or 90
descriptors.

So I do not think BQL is really needed, because a 512 slots TX ring wont
add a big latency : About 5 ms max.

BQL is generally nice for NIC supporting hardware TSO, where a 64KB TSO
packet consumes 3 or 4 descriptors.

Also note that TCP Small Queues should limit TX ring occupancy of a
single bulk flow anyway.
""

Maybe I misunderstood something?
Dave Taht Dec. 2, 2014, 5:31 p.m. UTC | #7
On Tue, Dec 2, 2014 at 9:12 AM, Ezequiel Garcia
<ezequiel.garcia@free-electrons.com> wrote:
> Eric,
>
> On 12/02/2014 09:18 AM, Eric Dumazet wrote:
> [..]
>>
>> I am surprised TCP even worked correctly with this problem.
>>
>> I highly suggest BQL for this driver, now this issue is fixed.
>>
>
> Implementing BQL for the mvneta driver was something I wanted to do a
> while ago, but you explained that these kind drivers (i.e. those with
> software TSO) didn't need BQL, because the latency that resulted from
> the ring was too small.
>
> Quoting (http://www.spinics.net/lists/netdev/msg284439.html):
> ""
> Note that a full size TSO packet (44 or 45 MSS) requires about 88 or 90
> descriptors.
>
> So I do not think BQL is really needed, because a 512 slots TX ring wont
> add a big latency : About 5 ms max.

at a gigabit.

Wildly variable with pause frames. And: 50ms at 100mbit. 500ms at 10mbit.

Secondly, why accept any latency in the driver when you can get it well below
2ms in most circumstances with a couple lines of BQL code?

>
> BQL is generally nice for NIC supporting hardware TSO, where a 64KB TSO
> packet consumes 3 or 4 descriptors.

BQL is nice at any speed, and darn useful at lower speeds,
as well.

>
> Also note that TCP Small Queues should limit TX ring occupancy of a
> single bulk flow anyway.

We only do one upload or one download at a time on our systems,
followed by a ping measurement... never do more than one flow
at the same time, right?


> ""
>
> Maybe I misunderstood something?

Try it, test it, you'll like it.

> --
> Ezequiel GarcĂ­a, Free Electrons
> Embedded Linux, Kernel and Android Engineering
> http://free-electrons.com
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 2, 2014, 6:39 p.m. UTC | #8
On Tue, 2014-12-02 at 14:12 -0300, Ezequiel Garcia wrote:
> Eric,
> 
> On 12/02/2014 09:18 AM, Eric Dumazet wrote:
> [..]
> > 
> > I am surprised TCP even worked correctly with this problem.
> > 
> > I highly suggest BQL for this driver, now this issue is fixed.
> > 
> 
> Implementing BQL for the mvneta driver was something I wanted to do a
> while ago, but you explained that these kind drivers (i.e. those with
> software TSO) didn't need BQL, because the latency that resulted from
> the ring was too small.
> 
> Quoting (http://www.spinics.net/lists/netdev/msg284439.html):
> ""
> Note that a full size TSO packet (44 or 45 MSS) requires about 88 or 90
> descriptors.
> 
> So I do not think BQL is really needed, because a 512 slots TX ring wont
> add a big latency : About 5 ms max.
> 
> BQL is generally nice for NIC supporting hardware TSO, where a 64KB TSO
> packet consumes 3 or 4 descriptors.
> 
> Also note that TCP Small Queues should limit TX ring occupancy of a
> single bulk flow anyway.
> ""
> 
> Maybe I misunderstood something?

This was indeed the case, but we added recently xmit_more support, and
it uses BQL information. So you might add BQL anyway, if xmit_more
support is useful for this hardware.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Dec. 9, 2014, 1:44 a.m. UTC | #9
From: Willy Tarreau <w@1wt.eu>
Date: Tue, 2 Dec 2014 08:13:04 +0100

> The mvneta driver sets the amount of Tx coalesce packets to 16 by
> default. Normally that does not cause any trouble since the driver
> uses a much larger Tx ring size (532 packets). But some sockets
> might run with very small buffers, much smaller than the equivalent
> of 16 packets. This is what ping is doing for example, by setting
> SNDBUF to 324 bytes rounded up to 2kB by the kernel.
> 
> The problem is that there is no documented method to force a specific
> packet to emit an interrupt (eg: the last of the ring) nor is it
> possible to make the NIC emit an interrupt after a given delay.
> 
> In this case, it causes trouble, because when ping sends packets over
> its raw socket, the few first packets leave the system, and the first
> 15 packets will be emitted without an IRQ being generated, so without
> the skbs being freed. And since the socket's buffer is small, there's
> no way to reach that amount of packets, and the ping ends up with
> "send: no buffer available" after sending 6 packets. Running with 3
> instances of ping in parallel is enough to hide the problem, because
> with 6 packets per instance, that's 18 packets total, which is enough
> to grant a Tx interrupt before all are sent.
> 
> The original driver in the LSP kernel worked around this design flaw
> by using a software timer to clean up the Tx descriptors. This timer
> was slow and caused terrible network performance on some Tx-bound
> workloads (such as routing) but was enough to make tools like ping
> work correctly.
> 
> Instead here, we simply set the packet counts before interrupt to 1.
> This ensures that each packet sent will produce an interrupt. NAPI
> takes care of coalescing interrupts since the interrupt is disabled
> once generated.
> 
> No measurable performance impact nor CPU usage were observed on small
> nor large packets, including when saturating the link on Tx, and this
> fixes tools like ping which rely on too small a send buffer. If one
> wants to increase this value for certain workloads where it is safe
> to do so, "ethtool -C $dev tx-frames" will override this default
> setting.
> 
> This fix needs to be applied to stable kernels starting with 3.10.
> 
> Tested-By: Maggie Mae Roxas <maggie.mae.roxas@gmail.com>
> Signed-off-by: Willy Tarreau <w@1wt.eu>

Applied and queued up for -stable, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 4762994..35bfba7 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -214,7 +214,7 @@ 
 /* Various constants */
 
 /* Coalescing */
-#define MVNETA_TXDONE_COAL_PKTS		16
+#define MVNETA_TXDONE_COAL_PKTS		1
 #define MVNETA_RX_COAL_PKTS		32
 #define MVNETA_RX_COAL_USEC		100