Message ID | 20171006175727.868-15-jeffrey.t.kirsher@intel.com |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
Series | 40GbE Intel Wired LAN Driver Updates 2017-10-06 | expand |
From: Jeff Kirsher > Sent: 06 October 2017 18:57 > From: Jacob Keller <jacob.e.keller@intel.com> > > Since commit 6a7fded776a7 ("i40e: Fix RS bit update in Tx path and > disable force WB workaround") we've tried to "optimize" setting the > RS bit based around skb->xmit_more. This same logic was refactored > in commit 1dc8b538795f ("i40e: Reorder logic for coalescing RS bits"), > but ultimately was not functionally changed. I've tried to understand the above, but without definitions of WB and RS and the '4-ness' of something it is quite difficult. I THINK this is all about telling the hardware there is a packet in the ring to transmit? I don't understand the 4-ness. Linux requires that the hardware be notified of a single packet transmit, and that the 'transmit complete' also be notified in 'a timely manner' for a single packet. skb->xmit_more ought to be usable to optimise both of these (assuming it isn't a lie). The driver would need to ensure a ring full of messages isn't generated without either wakeup - but that might be 128 frames. FWIW on the systems I have PCIe writes are relatively cheap (reads are slow). So different counts would be appropriate for delaying doorbell writes and requesting completion interrupts. David
On Mon, Oct 9, 2017 at 2:07 AM, David Laight <David.Laight@aculab.com> wrote: > From: Jeff Kirsher >> Sent: 06 October 2017 18:57 >> From: Jacob Keller <jacob.e.keller@intel.com> >> >> Since commit 6a7fded776a7 ("i40e: Fix RS bit update in Tx path and >> disable force WB workaround") we've tried to "optimize" setting the >> RS bit based around skb->xmit_more. This same logic was refactored >> in commit 1dc8b538795f ("i40e: Reorder logic for coalescing RS bits"), >> but ultimately was not functionally changed. > > I've tried to understand the above, but without definitions of WB > and RS and the '4-ness' of something it is quite difficult. > > I THINK this is all about telling the hardware there is a packet > in the ring to transmit? > > I don't understand the 4-ness. Linux requires that the hardware > be notified of a single packet transmit, and that the 'transmit > complete' also be notified in 'a timely manner' for a single packet. So to clarify some of this. The RS is short for Report Status. It tells the hardware that when it has completed the descriptor it should trigger a write back, aka WB. The 4-ness is because each descriptor is 16 bytes, so if we write back once every 4 descriptors that is one 64B cache line which is much better for most platforms performance wise. > skb->xmit_more ought to be usable to optimise both of these > (assuming it isn't a lie). Actually it leads to issues if we hold off for too long. If we are busy dequeueing a long list, and the Tx queue has gone empty then we are stalling Tx without need to do so. We need to be regularly notifying the device that it has buffers available to transmit which is what xmit_more is good for. However in addition the hardware needs to notify us regularly that there are buffers ready to clean which it isn't necessarily so useful for, especially if are filling the entire ring and only providing one notification to the driver that there are buffers ready since we can defer the Tx interrupt for too long. What Jake is trying to fix here is to prevent us from generating what looks like a square wave in terms of throughput. Essentially the current behavior that is being seen is that all the buffers in the ring either belong to software, or belong to hardware. It is best for us to have this split half and half so that the hardware has buffers to transmit and software has buffers to clean and enqueue new buffers onto. > The driver would need to ensure a ring full of messages isn't > generated without either wakeup - but that might be 128 frames. That is what is happening here. Basically we are guaranteeing writebacks are occurring on a regular basis instead of in large bursts. > FWIW on the systems I have PCIe writes are relatively cheap > (reads are slow). So different counts would be appropriate > for delaying doorbell writes and requesting completion interrupts. > > David The doorbell writes are still being delayed the same amount with this patch. The only thing that was split out was the device write backs are now occurring on a more regular basis instead of being held off until a much larger set is handled. - Alex
> -----Original Message----- > From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] > On Behalf Of Alexander Duyck > Sent: Monday, October 09, 2017 9:28 AM > To: David Laight <David.Laight@aculab.com> > Cc: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>; davem@davemloft.net; > Keller, Jacob E <jacob.e.keller@intel.com>; netdev@vger.kernel.org; > nhorman@redhat.com; sassmann@redhat.com; jogreene@redhat.com > Subject: Re: [net-next 14/15] i40e: ignore skb->xmit_more when deciding to set > RS bit > > On Mon, Oct 9, 2017 at 2:07 AM, David Laight <David.Laight@aculab.com> wrote: > > From: Jeff Kirsher > >> Sent: 06 October 2017 18:57 > >> From: Jacob Keller <jacob.e.keller@intel.com> > >> > >> Since commit 6a7fded776a7 ("i40e: Fix RS bit update in Tx path and > >> disable force WB workaround") we've tried to "optimize" setting the > >> RS bit based around skb->xmit_more. This same logic was refactored > >> in commit 1dc8b538795f ("i40e: Reorder logic for coalescing RS bits"), > >> but ultimately was not functionally changed. > > > > I've tried to understand the above, but without definitions of WB > > and RS and the '4-ness' of something it is quite difficult. > > > > I THINK this is all about telling the hardware there is a packet > > in the ring to transmit? > > > > I don't understand the 4-ness. Linux requires that the hardware > > be notified of a single packet transmit, and that the 'transmit > > complete' also be notified in 'a timely manner' for a single packet. > > So to clarify some of this. The RS is short for Report Status. It > tells the hardware that when it has completed the descriptor it should > trigger a write back, aka WB. > > The 4-ness is because each descriptor is 16 bytes, so if we write back > once every 4 descriptors that is one 64B cache line which is much > better for most platforms performance wise. > > > skb->xmit_more ought to be usable to optimise both of these > > (assuming it isn't a lie). > > Actually it leads to issues if we hold off for too long. If we are > busy dequeueing a long list, and the Tx queue has gone empty then we > are stalling Tx without need to do so. We need to be regularly > notifying the device that it has buffers available to transmit which > is what xmit_more is good for. However in addition the hardware needs > to notify us regularly that there are buffers ready to clean which it > isn't necessarily so useful for, especially if are filling the entire > ring and only providing one notification to the driver that there are > buffers ready since we can defer the Tx interrupt for too long. > > What Jake is trying to fix here is to prevent us from generating what > looks like a square wave in terms of throughput. Essentially the > current behavior that is being seen is that all the buffers in the > ring either belong to software, or belong to hardware. It is best for > us to have this split half and half so that the hardware has buffers > to transmit and software has buffers to clean and enqueue new buffers > onto. > Yes this sums it up pretty well. The test case which found this bug is outlined in the description. Essentially we could see up to dozens or even almost a hundred packets in a row bunched up if we had a bunch of virtual devices layered on top of the network driver. This led to us not indicating to get status from the hardware about finished packets, which results in an increased delay to finish Tx. > > The driver would need to ensure a ring full of messages isn't > > generated without either wakeup - but that might be 128 frames. > > That is what is happening here. Basically we are guaranteeing > writebacks are occurring on a regular basis instead of in large > bursts. > > > FWIW on the systems I have PCIe writes are relatively cheap > > (reads are slow). So different counts would be appropriate > > for delaying doorbell writes and requesting completion interrupts. > > > > David > > The doorbell writes are still being delayed the same amount with this > patch. The only thing that was split out was the device write backs > are now occurring on a more regular basis instead of being held off > until a much larger set is handled. > Exactly. The tail bumps are still functioning the same, but it's the hardware report status requests which are not delayed. Note that there's some further quirks in how our hardware performs write backs which exacerbate the problem because of how the cachelines are setup so that the delay is even worse than we would expect. Thanks, Jake > - Alex
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index d9fdf69bbc6e..3bd176606c09 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -3167,38 +3167,12 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb, /* write last descriptor with EOP bit */ td_cmd |= I40E_TX_DESC_CMD_EOP; - /* We can OR these values together as they both are checked against - * 4 below and at this point desc_count will be used as a boolean value - * after this if/else block. + /* We OR these values together to check both against 4 (WB_STRIDE) + * below. This is safe since we don't re-use desc_count afterwards. */ desc_count |= ++tx_ring->packet_stride; - /* Algorithm to optimize tail and RS bit setting: - * if queue is stopped - * mark RS bit - * reset packet counter - * else if xmit_more is supported and is true - * advance packet counter to 4 - * reset desc_count to 0 - * - * if desc_count >= 4 - * mark RS bit - * reset packet counter - * if desc_count > 0 - * update tail - * - * Note: If there are less than 4 descriptors - * pending and interrupts were disabled the service task will - * trigger a force WB. - */ - if (netif_xmit_stopped(txring_txq(tx_ring))) { - goto do_rs; - } else if (skb->xmit_more) { - /* set stride to arm on next packet and reset desc_count */ - tx_ring->packet_stride = WB_STRIDE; - desc_count = 0; - } else if (desc_count >= WB_STRIDE) { -do_rs: + if (desc_count >= WB_STRIDE) { /* write last descriptor with RS bit set */ td_cmd |= I40E_TX_DESC_CMD_RS; tx_ring->packet_stride = 0; @@ -3219,7 +3193,7 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb, first->next_to_watch = tx_desc; /* notify HW of packet */ - if (desc_count) { + if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) { writel(i, tx_ring->tail); /* we need this if more than one processor can write to our tail