diff mbox

twice past the taps, thence out to net?

Message ID 1324009676.2562.9.camel@edumazet-laptop
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Dec. 16, 2011, 4:27 a.m. UTC
Le jeudi 15 décembre 2011 à 14:22 -0800, Rick Jones a écrit :
> On 12/15/2011 11:00 AM, Eric Dumazet wrote:
> >> Device's work better if the driver proactively manages stop_queue/wake_queue.
> >> Old devices used TX_BUSY, but newer devices tend to manage the queue
> >> themselves.
> >>
> >
> > Some 'new' drivers like igb can be fooled in case skb is gso segmented ?
> >
> > Because igb_xmit_frame_ring() needs skb_shinfo(skb)->nr_frags + 4
> > descriptors, igb should stop its queue not at MAX_SKB_FRAGS + 4, but
> > MAX_SKB_FRAGS*4
> >
> > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> > index 89d576c..989da36 100644
> > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > @@ -4370,7 +4370,7 @@ netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,
> >   	igb_tx_map(tx_ring, first, hdr_len);
> >
> >   	/* Make sure there is space in the ring for the next send. */
> > -	igb_maybe_stop_tx(tx_ring, MAX_SKB_FRAGS + 4);
> > +	igb_maybe_stop_tx(tx_ring, MAX_SKB_FRAGS * 4);
> >
> >   	return NETDEV_TX_OK;
> 
> 
> Is there a minimum transmit queue length here?  I get the impression 
> that MAX_SKB_FRAGS is at least 16 and is 18 on a system with 4096 byte 
> pages.  The previous addition then would be OK so long as the TX queue 
> was always at least 22 entries in size, but now it would have to always 
> be at least 72?
> 
> I guess things are "OK" at the moment:
> 
> raj@tardy:~/net-next/drivers/net/ethernet/intel/igb$ grep IGB_MIN_TXD *.[ch]
> igb_ethtool.c:	new_tx_count = max_t(u16, new_tx_count, IGB_MIN_TXD);
> igb.h:#define IGB_MIN_TXD                       80
> 
> but is that getting a little close?
> 
> rick jones

Sure !

I only pointed out a possible problem, and not gave a full patch, since
we also need to change the opposite threshold (when we XON the queue at
TX completion)

You can see its not even consistent with the minimum for a single TSO
frame ! Most probably your high requeue numbers come from this too low
value given the real requirements of the hardware (4 + nr_frags
descriptors per skb)

/* How many Tx Descriptors do we need to call netif_wake_queue ? */ 
#define IGB_TX_QUEUE_WAKE   16


Maybe we should CC Intel guys

Could you try following patch ?

Thanks !



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jesse Brandeburg Dec. 16, 2011, 6:28 p.m. UTC | #1
On Thu, Dec 15, 2011 at 8:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 15 décembre 2011 à 14:22 -0800, Rick Jones a écrit :
>> On 12/15/2011 11:00 AM, Eric Dumazet wrote:
>> >> Device's work better if the driver proactively manages stop_queue/wake_queue.
>> >> Old devices used TX_BUSY, but newer devices tend to manage the queue
>> >> themselves.
>> >>
>> >
>> > Some 'new' drivers like igb can be fooled in case skb is gso segmented ?
>> >
>> > Because igb_xmit_frame_ring() needs skb_shinfo(skb)->nr_frags + 4
>> > descriptors, igb should stop its queue not at MAX_SKB_FRAGS + 4, but
>> > MAX_SKB_FRAGS*4

can you please help me understand the need for MAX_SKB_FRAGS * 4 as
the requirement?  Currently driver uses logic like

in hard_start_tx: hey I just finished a tx, I should stop the qdisc if
I don't have room (in tx descriptors) for a worst case transmit skb
(MAX_SKB_FRAGS + 4) the next time I'm called.
when cleaning from interrupt: My cleanup is done, do I have enough
free tx descriptors (should be MAX_SKB_FRAGS + 4) for a worst case
transmit?  If yes, restart qdisc.

I'm missing the jump from the above logic to using MAX_SKB_FRAGS * 4
(== (18 * 4) == 72) as the minimum number of descriptors I need for a
worst case TSO.  Each descriptor can point to up to 16kB of contiguous
memory, typically we use 1 for offload context setup, 1 for skb->data,
and 1 for each page.  I think we may be overestimating with
MAX_SKB_FRAGS + 4, but that should be no big deal.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 16, 2011, 7:34 p.m. UTC | #2
Le vendredi 16 décembre 2011 à 10:28 -0800, Jesse Brandeburg a écrit :
> On Thu, Dec 15, 2011 at 8:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le jeudi 15 décembre 2011 à 14:22 -0800, Rick Jones a écrit :
> >> On 12/15/2011 11:00 AM, Eric Dumazet wrote:
> >> >> Device's work better if the driver proactively manages stop_queue/wake_queue.
> >> >> Old devices used TX_BUSY, but newer devices tend to manage the queue
> >> >> themselves.
> >> >>
> >> >
> >> > Some 'new' drivers like igb can be fooled in case skb is gso segmented ?
> >> >
> >> > Because igb_xmit_frame_ring() needs skb_shinfo(skb)->nr_frags + 4
> >> > descriptors, igb should stop its queue not at MAX_SKB_FRAGS + 4, but
> >> > MAX_SKB_FRAGS*4
> 
> can you please help me understand the need for MAX_SKB_FRAGS * 4 as
> the requirement?  Currently driver uses logic like
> 
> in hard_start_tx: hey I just finished a tx, I should stop the qdisc if
> I don't have room (in tx descriptors) for a worst case transmit skb
> (MAX_SKB_FRAGS + 4) the next time I'm called.
> when cleaning from interrupt: My cleanup is done, do I have enough
> free tx descriptors (should be MAX_SKB_FRAGS + 4) for a worst case
> transmit?  If yes, restart qdisc.
> 
> I'm missing the jump from the above logic to using MAX_SKB_FRAGS * 4
> (== (18 * 4) == 72) as the minimum number of descriptors I need for a
> worst case TSO.  Each descriptor can point to up to 16kB of contiguous
> memory, typically we use 1 for offload context setup, 1 for skb->data,
> and 1 for each page.  I think we may be overestimating with
> MAX_SKB_FRAGS + 4, but that should be no big deal.

Did you read my second patch ?

Problem is you wakeup the queue too soon (16 available descriptors,
while a full TSO packet needs more than that)

How would you explain high 'requeues' number if it was not the problem ?

Also, its suboptimal to wakeup the queue if available space is very low,
since only _one_ packet may be dequeued from qdisc (you pay high cost in
cache line bouncing)

My first patch was about a very rare event : A full TSO packet is
segmented in gso_segment() [ say if you dynamically disable sg on eth
device and an old tcp buffer is retransmitted ] : You end with 16 skbs
delivered to NIC : In this case we can hit tx ring limit at 4th or 5th
skb, and Rick complains tcpdump outputs some packets several times ;)

Since igb needs 4 descriptors for linear skb, I said : 4 *
MAX_SKB_FRAGS, but real problem is addressed in my second patch, I
believe ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Dec. 16, 2011, 7:35 p.m. UTC | #3
>> but is that getting a little close?
>>
>> rick jones
>
> Sure !
>
> I only pointed out a possible problem, and not gave a full patch, since
> we also need to change the opposite threshold (when we XON the queue at
> TX completion)
>
> You can see its not even consistent with the minimum for a single TSO
> frame ! Most probably your high requeue numbers come from this too low
> value given the real requirements of the hardware (4 + nr_frags
> descriptors per skb)
>
> /* How many Tx Descriptors do we need to call netif_wake_queue ? */
> #define IGB_TX_QUEUE_WAKE   16
>
>
> Maybe we should CC Intel guys
>
> Could you try following patch ?

I would *love* to.  All my accessible igb-driven hardware is in an 
environment locked to the kernels already there :(  Not that it makes it 
more possible for me to do it, but I suspect it does not require 30 
receivers to reproduce the dups with netperf TCP_STREAM.  Particularly 
if the tx queue len is at 256 it may only take 6 or 8. In fact let me 
try that now...

Yep, with just 8 destinations/concurrent TCP_STREAM tests from the one 
system one can still see the duplicates in the packet trace taken on the 
sender.

Perhaps we can trouble the Intel guys to try to reproduce what I've seen?

rick

>
> Thanks !
>
> diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
> index c69feeb..93ce118 100644
> --- a/drivers/net/ethernet/intel/igb/igb.h
> +++ b/drivers/net/ethernet/intel/igb/igb.h
> @@ -51,8 +51,8 @@ struct igb_adapter;
>   /* TX/RX descriptor defines */
>   #define IGB_DEFAULT_TXD                  256
>   #define IGB_DEFAULT_TX_WORK		 128
> -#define IGB_MIN_TXD                       80
> -#define IGB_MAX_TXD                     4096
> +#define IGB_MIN_TXD		max_t(unsigned, 80U, IGB_TX_QUEUE_WAKE * 2)
> +#define IGB_MAX_TXD             4096
>
>   #define IGB_DEFAULT_RXD                  256
>   #define IGB_MIN_RXD                       80
> @@ -121,8 +121,11 @@ struct vf_data_storage {
>   #define IGB_RXBUFFER_16384 16384
>   #define IGB_RX_HDR_LEN     IGB_RXBUFFER_512
>
> -/* How many Tx Descriptors do we need to call netif_wake_queue ? */
> -#define IGB_TX_QUEUE_WAKE	16
> +/* How many Tx Descriptors should be available
> + * before calling netif_wake_subqueue() ?
> + */
> +#define IGB_TX_QUEUE_WAKE	(MAX_SKB_FRAGS * 4)
> +
>   /* How many Rx Buffers do we bundle into one write to the hardware ? */
>   #define IGB_RX_BUFFER_WRITE	16	/* Must be power of 2 */
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 16, 2011, 7:44 p.m. UTC | #4
Le vendredi 16 décembre 2011 à 11:35 -0800, Rick Jones a écrit :

> I would *love* to.  All my accessible igb-driven hardware is in an 
> environment locked to the kernels already there :(  Not that it makes it 
> more possible for me to do it, but I suspect it does not require 30 
> receivers to reproduce the dups with netperf TCP_STREAM.  Particularly 
> if the tx queue len is at 256 it may only take 6 or 8. In fact let me 
> try that now...
> 
> Yep, with just 8 destinations/concurrent TCP_STREAM tests from the one 
> system one can still see the duplicates in the packet trace taken on the 
> sender.
> 
> Perhaps we can trouble the Intel guys to try to reproduce what I've seen?
> 

I do have an igb card somewhere (in fact two dual ports), I'll do the
test myself !

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wyborny, Carolyn Dec. 20, 2011, 9:21 p.m. UTC | #5
>-----Original Message-----

>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]

>On Behalf Of Eric Dumazet

>Sent: Friday, December 16, 2011 11:45 AM

>To: Rick Jones

>Cc: Stephen Hemminger; Vijay Subramanian; tcpdump-

>workers@lists.tcpdump.org; netdev@vger.kernel.org; Vick, Matthew;

>Kirsher, Jeffrey T

>Subject: Re: twice past the taps, thence out to net?

>

>Le vendredi 16 décembre 2011 à 11:35 -0800, Rick Jones a écrit :

>

>> I would *love* to.  All my accessible igb-driven hardware is in an

>> environment locked to the kernels already there :(  Not that it makes

>it

>> more possible for me to do it, but I suspect it does not require 30

>> receivers to reproduce the dups with netperf TCP_STREAM.  Particularly

>> if the tx queue len is at 256 it may only take 6 or 8. In fact let me

>> try that now...

>>

>> Yep, with just 8 destinations/concurrent TCP_STREAM tests from the one

>> system one can still see the duplicates in the packet trace taken on

>the

>> sender.

>>

>> Perhaps we can trouble the Intel guys to try to reproduce what I've

>seen?

>>

>

>I do have an igb card somewhere (in fact two dual ports), I'll do the

>test myself !

>

>Thanks

>

>

>--

>To unsubscribe from this list: send the line "unsubscribe netdev" in

>the body of a message to majordomo@vger.kernel.org

>More majordomo info at  http://vger.kernel.org/majordomo-info.html


Let me know if I can do anything to assist.  Sorry to have overlooked this thread for a bit.

Thanks,

Carolyn

Carolyn Wyborny
Linux Development
LAN Access Division
Intel Corporation
diff mbox

Patch

diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index c69feeb..93ce118 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -51,8 +51,8 @@  struct igb_adapter;
 /* TX/RX descriptor defines */
 #define IGB_DEFAULT_TXD                  256
 #define IGB_DEFAULT_TX_WORK		 128
-#define IGB_MIN_TXD                       80
-#define IGB_MAX_TXD                     4096
+#define IGB_MIN_TXD		max_t(unsigned, 80U, IGB_TX_QUEUE_WAKE * 2)
+#define IGB_MAX_TXD             4096
 
 #define IGB_DEFAULT_RXD                  256
 #define IGB_MIN_RXD                       80
@@ -121,8 +121,11 @@  struct vf_data_storage {
 #define IGB_RXBUFFER_16384 16384
 #define IGB_RX_HDR_LEN     IGB_RXBUFFER_512
 
-/* How many Tx Descriptors do we need to call netif_wake_queue ? */
-#define IGB_TX_QUEUE_WAKE	16
+/* How many Tx Descriptors should be available
+ * before calling netif_wake_subqueue() ?
+ */
+#define IGB_TX_QUEUE_WAKE	(MAX_SKB_FRAGS * 4)
+
 /* How many Rx Buffers do we bundle into one write to the hardware ? */
 #define IGB_RX_BUFFER_WRITE	16	/* Must be power of 2 */