[net-next,2/5] ixgbe: increase default TX ring buffer to 1024

Message ID 20140514141748.20309.83121.stgit@dragon
State RFC
Delegated to: David Miller
Headers show

Commit Message

Jesper Dangaard Brouer May 14, 2014, 2:17 p.m.
Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring
running full.  Thus, the TX ring is artificially limiting pktgen.

Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy"
counters.

Increasing the TX ring buffer should be done carefully, as it comes at
a higher memory cost, which can also negatively influence performance.
E.g. ring buffer array of struct ixgbe_tx_buffer (current size 48bytes)
increase from 512*48=24576bytes to 1024*48=49152bytes which is larger
than the L1 data cache (32KB on my E5-2630), thus increasing the L1->L2
cache-references.

Adjusting the TX ring buffer (TXSZ) measured over 10 sec with ifpps
 (single CPU performance, ixgbe 10Gbit/s, E5-2630)
 * cmd: ethtool -G eth8 tx $TXSZ
 * 3,930,065 pps -- TXSZ= 512
 * 5,312,249 pps -- TXSZ= 768
 * 5,362,722 pps -- TXSZ=1024
 * 5,361,390 pps -- TXSZ=1536
 * 5,362,439 pps -- TXSZ=2048
 * 5,359,744 pps -- TXSZ=4096

Choosing size 1024 because for the next optimizations 768 is not
enough.

Notice after commit 6f25cd47d (pktgen: fix xmit test for BQL enabled
devices) pktgen uses netif_xmit_frozen_or_drv_stopped() and ignores
the BQL "stack" pause (QUEUE_STATE_STACK_XOFF) flag.  This allow us to put
more pressure on the TX ring buffers.

It is the ixgbe_maybe_stop_tx() call that stops the transmits, and
pktgen respecting this in the call to netif_xmit_frozen_or_drv_stopped(txq).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 drivers/net/ethernet/intel/ixgbe/ixgbe.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Laight May 14, 2014, 2:28 p.m. | #1
From: Jesper Dangaard Brouer

> Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring

> running full.  Thus, the TX ring is artificially limiting pktgen.

> 

> Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy"

> counters.


Have you tried reducing the tx interrupt mitigation delay.
It might just be that the end of tx interrupt is delayed too long.

	David
Alexander Duyck May 14, 2014, 4:28 p.m. | #2
On 05/14/2014 07:17 AM, Jesper Dangaard Brouer wrote:
> Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring
> running full.  Thus, the TX ring is artificially limiting pktgen.
> 
> Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy"
> counters.
> 
> Increasing the TX ring buffer should be done carefully, as it comes at
> a higher memory cost, which can also negatively influence performance.
> E.g. ring buffer array of struct ixgbe_tx_buffer (current size 48bytes)
> increase from 512*48=24576bytes to 1024*48=49152bytes which is larger
> than the L1 data cache (32KB on my E5-2630), thus increasing the L1->L2
> cache-references.
> 
> Adjusting the TX ring buffer (TXSZ) measured over 10 sec with ifpps
>  (single CPU performance, ixgbe 10Gbit/s, E5-2630)
>  * cmd: ethtool -G eth8 tx $TXSZ
>  * 3,930,065 pps -- TXSZ= 512
>  * 5,312,249 pps -- TXSZ= 768
>  * 5,362,722 pps -- TXSZ=1024
>  * 5,361,390 pps -- TXSZ=1536
>  * 5,362,439 pps -- TXSZ=2048
>  * 5,359,744 pps -- TXSZ=4096
> 
> Choosing size 1024 because for the next optimizations 768 is not
> enough.
> 
> Notice after commit 6f25cd47d (pktgen: fix xmit test for BQL enabled
> devices) pktgen uses netif_xmit_frozen_or_drv_stopped() and ignores
> the BQL "stack" pause (QUEUE_STATE_STACK_XOFF) flag.  This allow us to put
> more pressure on the TX ring buffers.
> 
> It is the ixgbe_maybe_stop_tx() call that stops the transmits, and
> pktgen respecting this in the call to netif_xmit_frozen_or_drv_stopped(txq).
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
> 
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> index c688c8a..bf078fe 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> @@ -63,7 +63,7 @@
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>  
>  /* TX/RX descriptor defines */
> -#define IXGBE_DEFAULT_TXD		    512
> +#define IXGBE_DEFAULT_TXD		   1024
>  #define IXGBE_DEFAULT_TX_WORK		    256
>  #define IXGBE_MAX_TXD			   4096
>  #define IXGBE_MIN_TXD			     64
> 

What is the point of optimizing ixgbe for a synthetic benchmark?  In my
experience the full stack can only handle about 2Mpps at 60B packets
with a single queue.  Updating the defaults for a pktgen test seems
unrealistic as that isn't really a standard use case for the driver.

I'd say that it might be better to just add a note to the documentation
folder indicating what configuration is optimal for pktgen rather then
changing everyone's defaults to support one specific test.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 14, 2014, 5:49 p.m. | #3
From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Wed, 14 May 2014 09:28:50 -0700

> I'd say that it might be better to just add a note to the documentation
> folder indicating what configuration is optimal for pktgen rather then
> changing everyone's defaults to support one specific test.

We could have drivers provide a pktgen config adjustment mechanism,
so if someone starts pktgen then the device auto-adjusts to a pktgen
optimal configuration (whatever that may entail).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer May 14, 2014, 7:09 p.m. | #4
On Wed, 14 May 2014 13:49:50 -0400 (EDT)
David Miller <davem@davemloft.net> wrote:

> From: Alexander Duyck <alexander.h.duyck@intel.com>
> Date: Wed, 14 May 2014 09:28:50 -0700
> 
> > I'd say that it might be better to just add a note to the documentation
> > folder indicating what configuration is optimal for pktgen rather then
> > changing everyone's defaults to support one specific test.
> 
> We could have drivers provide a pktgen config adjustment mechanism,
> so if someone starts pktgen then the device auto-adjusts to a pktgen
> optimal configuration (whatever that may entail).

That might be problematic because changing the TX queue size cause the
ixgbe driver to reset the link.

Notice that pktgen is ignoring BQL.  I'm sort of hoping that BQL will
push back for real use-cases, to avoid the bad effects of increasing
the TX size.

One of the bad effects, I'm hoping BQL will mitigate, is the case of
filling the TX queue with large frames.  Consider 9K jumbo frames, how
long time will it take to empty 1024 jumbo frames on a 10G link:

(9000*8)/(10000*10^6)*1000*1024 = 7.37ms

But with 9K MTU and 512, we already have:
 (9000*8)/(10000*10^6)*1000*512 = 3.69ms

Guess the more normal use-case would be 1500+38 (Ethernet overhead)
 (1538*8)/(10000*10^6)*1000*1024 = 1.25ms

And then again, these calculations should then in theory be multiplied
with the number of TX queues.


I know, increasing these limits should not be taken lightly, but we
just have to be crystal clear that the current 512 limit, is
artificially limiting the capabilities of your hardware.

We can postpone this increase, because I also observe a 2Mpps limit
when actually using (alloc/free) real SKBs.  The alloc/free cost is
currently just hiding this limitation.
Jesper Dangaard Brouer May 14, 2014, 7:25 p.m. | #5
On Wed, 14 May 2014 14:28:24 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Jesper Dangaard Brouer
> > Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring
> > running full.  Thus, the TX ring is artificially limiting pktgen.
> > 
> > Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy"
> > counters.
> 
> Have you tried reducing the tx interrupt mitigation delay.
> It might just be that the end of tx interrupt is delayed too long.

Does the ixgbe have TX interrupt mitigation delays?

I don't "see" them:

$ sudo ethtool -c eth8 
Coalesce parameters for eth8:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 1
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 0
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0
David Miller May 14, 2014, 7:54 p.m. | #6
From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Wed, 14 May 2014 21:09:35 +0200

> On Wed, 14 May 2014 13:49:50 -0400 (EDT)
> David Miller <davem@davemloft.net> wrote:
> 
>> From: Alexander Duyck <alexander.h.duyck@intel.com>
>> Date: Wed, 14 May 2014 09:28:50 -0700
>> 
>> > I'd say that it might be better to just add a note to the documentation
>> > folder indicating what configuration is optimal for pktgen rather then
>> > changing everyone's defaults to support one specific test.
>> 
>> We could have drivers provide a pktgen config adjustment mechanism,
>> so if someone starts pktgen then the device auto-adjusts to a pktgen
>> optimal configuration (whatever that may entail).
> 
> That might be problematic because changing the TX queue size cause the
> ixgbe driver to reset the link.

A minor issue for someone firing up a specialized network test tool
like pktgen.

> Notice that pktgen is ignoring BQL.  I'm sort of hoping that BQL will
> push back for real use-cases, to avoid the bad effects of increasing
> the TX size.

I think we need to think carefully about drivers configuring such huge
per-queue TX queue sizes.

If anything, we should be decreasing their size in the default
configuration.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight May 15, 2014, 9:16 a.m. | #7
From: Jesper Dangaard Brouer
> On Wed, 14 May 2014 13:49:50 -0400 (EDT)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: Alexander Duyck <alexander.h.duyck@intel.com>
> > Date: Wed, 14 May 2014 09:28:50 -0700
> >
> > > I'd say that it might be better to just add a note to the documentation
> > > folder indicating what configuration is optimal for pktgen rather then
> > > changing everyone's defaults to support one specific test.
> >
> > We could have drivers provide a pktgen config adjustment mechanism,
> > so if someone starts pktgen then the device auto-adjusts to a pktgen
> > optimal configuration (whatever that may entail).
> 
> That might be problematic because changing the TX queue size cause the
> ixgbe driver to reset the link.
> 
> Notice that pktgen is ignoring BQL.  I'm sort of hoping that BQL will
> push back for real use-cases, to avoid the bad effects of increasing
> the TX size.
> 
> One of the bad effects, I'm hoping BQL will mitigate, is the case of
> filling the TX queue with large frames.  Consider 9K jumbo frames, how
> long time will it take to empty 1024 jumbo frames on a 10G link:
> 
> (9000*8)/(10000*10^6)*1000*1024 = 7.37ms

Never mind 9k 'jumbo' frames, I'm pretty sure ixgbe supports TCP segmentation
offload - so you can have 64k frames in the tx ring.
Since each takes (about) 44 ethernet frames that is about 55ms
to clear the queue.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer May 29, 2014, 3:29 p.m. | #8
On Wed, 14 May 2014 21:09:35 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> > From: Alexander Duyck <alexander.h.duyck@intel.com>
> > Date: Wed, 14 May 2014 09:28:50 -0700
> > 
> > > I'd say that it might be better to just add a note to the documentation
> > > folder indicating what configuration is optimal for pktgen rather then
> > > changing everyone's defaults to support one specific test.
>
> I know, increasing these limits should not be taken lightly, but we
> just have to be crystal clear that the current 512 limit, is
> artificially limiting the capabilities of your hardware.

The above statement is mine and it is wrong ;-)

 I'm dropping this patch because of the following understanding:

Alexander Duyck pointed out to me, that interrupt throttling might be
the reason behind the need to increase the TX ring size.  I tested this
and Alex is right.

The needed TX ring size ("ethtool -g") for max performance, is
directly corrolated with how fast/often the TX cleanup is running.

Adjusting the "ethtool -C <ethx> rx-usecs" value affect how often we
cleanup the ring(s).  The default value "1" is some auto interrupt throttling.

Notice with these coalesce tuning, the performance even increase from
6.7Mpps to 7.1Mpps on top of patchset V1.

On top of V1 patchset:
 - 6,747,016 pps - rx-usecs:  1 tx-ring: 1024 (irqs: 9492)
 - 6,684,612 pps - rx-usecs: 10 tx-ring: 1024 (irqs:99322)
 - 7,005,226 pps - rx-usecs: 20 tx-ring: 1024 (irqs:50444)
 - 7,113,048 pps - rx-usecs: 30 tx-ring: 1024 (irqs:34004)
 - 7,133,019 pps - rx-usecs: 40 tx-ring: 1024 (irqs:25845)
 - 7,168,399 pps - rx-usecs: 50 tx-ring: 1024 (irqs:20896)

Look same performance with 512 TX ring.

Lowering TX ring size to (default) 512:
 (On top of V1 patchset)
 - 3,934,674 pps - rx-usecs:  1 tx-ring: 512 (irqs: 9602)
 - 6,684,066 pps - rx-usecs: 10 tx-ring: 512 (irqs:99370)
 - 7,001,235 pps - rx-usecs: 20 tx-ring: 512 (irqs:50567)
 - 7,115,047 pps - rx-usecs: 30 tx-ring: 512 (irqs:34105)
 - 7,130,250 pps - rx-usecs: 40 tx-ring: 512 (irqs:25741)
 - 7,165,296 pps - rx-usecs: 50 tx-ring: 512 (irqs:20898)

Look how even a 256 TX ring is enough, if we cleanup the TX ring fast
enough, and how performance decrease if we cleanup to slowly.

Lowering TX ring size to (default) 256:
 (On top of V1 patchset)
 - 1,883,360 pps - rx-usecs:  1 tx-ring: 256 (irqs: 9800)
 - 6,683,552 pps - rx-usecs: 10 tx-ring: 256 (irqs:99786)
 - 7,005,004 pps - rx-usecs: 20 tx-ring: 256 (irqs:50749)
 - 7,108,776 pps - rx-usecs: 30 tx-ring: 256 (irqs:34536)
 - 5,734,301 pps - rx-usecs: 40 tx-ring: 256 (irqs:25909)
 - 4,590,093 pps - rx-usecs: 50 tx-ring: 256 (irqs:21183)

Patch

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index c688c8a..bf078fe 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -63,7 +63,7 @@ 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 /* TX/RX descriptor defines */
-#define IXGBE_DEFAULT_TXD		    512
+#define IXGBE_DEFAULT_TXD		   1024
 #define IXGBE_DEFAULT_TX_WORK		    256
 #define IXGBE_MAX_TXD			   4096
 #define IXGBE_MIN_TXD			     64