Message ID | 1324009676.2562.9.camel@edumazet-laptop |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, Dec 15, 2011 at 8:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le jeudi 15 décembre 2011 à 14:22 -0800, Rick Jones a écrit : >> On 12/15/2011 11:00 AM, Eric Dumazet wrote: >> >> Device's work better if the driver proactively manages stop_queue/wake_queue. >> >> Old devices used TX_BUSY, but newer devices tend to manage the queue >> >> themselves. >> >> >> > >> > Some 'new' drivers like igb can be fooled in case skb is gso segmented ? >> > >> > Because igb_xmit_frame_ring() needs skb_shinfo(skb)->nr_frags + 4 >> > descriptors, igb should stop its queue not at MAX_SKB_FRAGS + 4, but >> > MAX_SKB_FRAGS*4 can you please help me understand the need for MAX_SKB_FRAGS * 4 as the requirement? Currently driver uses logic like in hard_start_tx: hey I just finished a tx, I should stop the qdisc if I don't have room (in tx descriptors) for a worst case transmit skb (MAX_SKB_FRAGS + 4) the next time I'm called. when cleaning from interrupt: My cleanup is done, do I have enough free tx descriptors (should be MAX_SKB_FRAGS + 4) for a worst case transmit? If yes, restart qdisc. I'm missing the jump from the above logic to using MAX_SKB_FRAGS * 4 (== (18 * 4) == 72) as the minimum number of descriptors I need for a worst case TSO. Each descriptor can point to up to 16kB of contiguous memory, typically we use 1 for offload context setup, 1 for skb->data, and 1 for each page. I think we may be overestimating with MAX_SKB_FRAGS + 4, but that should be no big deal. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le vendredi 16 décembre 2011 à 10:28 -0800, Jesse Brandeburg a écrit : > On Thu, Dec 15, 2011 at 8:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > Le jeudi 15 décembre 2011 à 14:22 -0800, Rick Jones a écrit : > >> On 12/15/2011 11:00 AM, Eric Dumazet wrote: > >> >> Device's work better if the driver proactively manages stop_queue/wake_queue. > >> >> Old devices used TX_BUSY, but newer devices tend to manage the queue > >> >> themselves. > >> >> > >> > > >> > Some 'new' drivers like igb can be fooled in case skb is gso segmented ? > >> > > >> > Because igb_xmit_frame_ring() needs skb_shinfo(skb)->nr_frags + 4 > >> > descriptors, igb should stop its queue not at MAX_SKB_FRAGS + 4, but > >> > MAX_SKB_FRAGS*4 > > can you please help me understand the need for MAX_SKB_FRAGS * 4 as > the requirement? Currently driver uses logic like > > in hard_start_tx: hey I just finished a tx, I should stop the qdisc if > I don't have room (in tx descriptors) for a worst case transmit skb > (MAX_SKB_FRAGS + 4) the next time I'm called. > when cleaning from interrupt: My cleanup is done, do I have enough > free tx descriptors (should be MAX_SKB_FRAGS + 4) for a worst case > transmit? If yes, restart qdisc. > > I'm missing the jump from the above logic to using MAX_SKB_FRAGS * 4 > (== (18 * 4) == 72) as the minimum number of descriptors I need for a > worst case TSO. Each descriptor can point to up to 16kB of contiguous > memory, typically we use 1 for offload context setup, 1 for skb->data, > and 1 for each page. I think we may be overestimating with > MAX_SKB_FRAGS + 4, but that should be no big deal. Did you read my second patch ? Problem is you wakeup the queue too soon (16 available descriptors, while a full TSO packet needs more than that) How would you explain high 'requeues' number if it was not the problem ? Also, its suboptimal to wakeup the queue if available space is very low, since only _one_ packet may be dequeued from qdisc (you pay high cost in cache line bouncing) My first patch was about a very rare event : A full TSO packet is segmented in gso_segment() [ say if you dynamically disable sg on eth device and an old tcp buffer is retransmitted ] : You end with 16 skbs delivered to NIC : In this case we can hit tx ring limit at 4th or 5th skb, and Rick complains tcpdump outputs some packets several times ;) Since igb needs 4 descriptors for linear skb, I said : 4 * MAX_SKB_FRAGS, but real problem is addressed in my second patch, I believe ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> but is that getting a little close? >> >> rick jones > > Sure ! > > I only pointed out a possible problem, and not gave a full patch, since > we also need to change the opposite threshold (when we XON the queue at > TX completion) > > You can see its not even consistent with the minimum for a single TSO > frame ! Most probably your high requeue numbers come from this too low > value given the real requirements of the hardware (4 + nr_frags > descriptors per skb) > > /* How many Tx Descriptors do we need to call netif_wake_queue ? */ > #define IGB_TX_QUEUE_WAKE 16 > > > Maybe we should CC Intel guys > > Could you try following patch ? I would *love* to. All my accessible igb-driven hardware is in an environment locked to the kernels already there :( Not that it makes it more possible for me to do it, but I suspect it does not require 30 receivers to reproduce the dups with netperf TCP_STREAM. Particularly if the tx queue len is at 256 it may only take 6 or 8. In fact let me try that now... Yep, with just 8 destinations/concurrent TCP_STREAM tests from the one system one can still see the duplicates in the packet trace taken on the sender. Perhaps we can trouble the Intel guys to try to reproduce what I've seen? rick > > Thanks ! > > diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h > index c69feeb..93ce118 100644 > --- a/drivers/net/ethernet/intel/igb/igb.h > +++ b/drivers/net/ethernet/intel/igb/igb.h > @@ -51,8 +51,8 @@ struct igb_adapter; > /* TX/RX descriptor defines */ > #define IGB_DEFAULT_TXD 256 > #define IGB_DEFAULT_TX_WORK 128 > -#define IGB_MIN_TXD 80 > -#define IGB_MAX_TXD 4096 > +#define IGB_MIN_TXD max_t(unsigned, 80U, IGB_TX_QUEUE_WAKE * 2) > +#define IGB_MAX_TXD 4096 > > #define IGB_DEFAULT_RXD 256 > #define IGB_MIN_RXD 80 > @@ -121,8 +121,11 @@ struct vf_data_storage { > #define IGB_RXBUFFER_16384 16384 > #define IGB_RX_HDR_LEN IGB_RXBUFFER_512 > > -/* How many Tx Descriptors do we need to call netif_wake_queue ? */ > -#define IGB_TX_QUEUE_WAKE 16 > +/* How many Tx Descriptors should be available > + * before calling netif_wake_subqueue() ? > + */ > +#define IGB_TX_QUEUE_WAKE (MAX_SKB_FRAGS * 4) > + > /* How many Rx Buffers do we bundle into one write to the hardware ? */ > #define IGB_RX_BUFFER_WRITE 16 /* Must be power of 2 */ > > > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le vendredi 16 décembre 2011 à 11:35 -0800, Rick Jones a écrit : > I would *love* to. All my accessible igb-driven hardware is in an > environment locked to the kernels already there :( Not that it makes it > more possible for me to do it, but I suspect it does not require 30 > receivers to reproduce the dups with netperf TCP_STREAM. Particularly > if the tx queue len is at 256 it may only take 6 or 8. In fact let me > try that now... > > Yep, with just 8 destinations/concurrent TCP_STREAM tests from the one > system one can still see the duplicates in the packet trace taken on the > sender. > > Perhaps we can trouble the Intel guys to try to reproduce what I've seen? > I do have an igb card somewhere (in fact two dual ports), I'll do the test myself ! Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>-----Original Message----- >From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] >On Behalf Of Eric Dumazet >Sent: Friday, December 16, 2011 11:45 AM >To: Rick Jones >Cc: Stephen Hemminger; Vijay Subramanian; tcpdump- >workers@lists.tcpdump.org; netdev@vger.kernel.org; Vick, Matthew; >Kirsher, Jeffrey T >Subject: Re: twice past the taps, thence out to net? > >Le vendredi 16 décembre 2011 à 11:35 -0800, Rick Jones a écrit : > >> I would *love* to. All my accessible igb-driven hardware is in an >> environment locked to the kernels already there :( Not that it makes >it >> more possible for me to do it, but I suspect it does not require 30 >> receivers to reproduce the dups with netperf TCP_STREAM. Particularly >> if the tx queue len is at 256 it may only take 6 or 8. In fact let me >> try that now... >> >> Yep, with just 8 destinations/concurrent TCP_STREAM tests from the one >> system one can still see the duplicates in the packet trace taken on >the >> sender. >> >> Perhaps we can trouble the Intel guys to try to reproduce what I've >seen? >> > >I do have an igb card somewhere (in fact two dual ports), I'll do the >test myself ! > >Thanks > > >-- >To unsubscribe from this list: send the line "unsubscribe netdev" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html Let me know if I can do anything to assist. Sorry to have overlooked this thread for a bit. Thanks, Carolyn Carolyn Wyborny Linux Development LAN Access Division Intel Corporation
diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h index c69feeb..93ce118 100644 --- a/drivers/net/ethernet/intel/igb/igb.h +++ b/drivers/net/ethernet/intel/igb/igb.h @@ -51,8 +51,8 @@ struct igb_adapter; /* TX/RX descriptor defines */ #define IGB_DEFAULT_TXD 256 #define IGB_DEFAULT_TX_WORK 128 -#define IGB_MIN_TXD 80 -#define IGB_MAX_TXD 4096 +#define IGB_MIN_TXD max_t(unsigned, 80U, IGB_TX_QUEUE_WAKE * 2) +#define IGB_MAX_TXD 4096 #define IGB_DEFAULT_RXD 256 #define IGB_MIN_RXD 80 @@ -121,8 +121,11 @@ struct vf_data_storage { #define IGB_RXBUFFER_16384 16384 #define IGB_RX_HDR_LEN IGB_RXBUFFER_512 -/* How many Tx Descriptors do we need to call netif_wake_queue ? */ -#define IGB_TX_QUEUE_WAKE 16 +/* How many Tx Descriptors should be available + * before calling netif_wake_subqueue() ? + */ +#define IGB_TX_QUEUE_WAKE (MAX_SKB_FRAGS * 4) + /* How many Rx Buffers do we bundle into one write to the hardware ? */ #define IGB_RX_BUFFER_WRITE 16 /* Must be power of 2 */