diff mbox series

[net-next,14/15] i40e: ignore skb->xmit_more when deciding to set RS bit

Message ID 20171006175727.868-15-jeffrey.t.kirsher@intel.com
State Accepted, archived
Delegated to: David Miller
Headers show
Series 40GbE Intel Wired LAN Driver Updates 2017-10-06 | expand

Commit Message

Kirsher, Jeffrey T Oct. 6, 2017, 5:57 p.m. UTC
From: Jacob Keller <jacob.e.keller@intel.com>

Since commit 6a7fded776a7 ("i40e: Fix RS bit update in Tx path and
disable force WB workaround") we've tried to "optimize" setting the
RS bit based around skb->xmit_more. This same logic was refactored
in commit 1dc8b538795f ("i40e: Reorder logic for coalescing RS bits"),
but ultimately was not functionally changed.

Using skb->xmit_more in this way is incorrect, because in certain
circumstances we may see a large number of skbs in sequence with
xmit_more set. This leads to a performance loss as the hardware does not
writeback anything for those packets, which delays the time it takes for
us to respond to the stack transmit requests. This significantly impacts
UDP performance, especially when layered with multiple devices, such as
bonding, VLANs, and vnet setups.

This was not noticed until now because it is difficult to create a setup
which reproduces the issue. It was discovered in a UDP_STREAM test in
a VM, connected using a vnet device to a bridge, which is connected to
a bonded pair of X710 ports in active-backup mode with a VLAN. These
layered devices seem to compound the number of skbs transmitted at once
by the qdisc. Additionally, the problem can be masked by reducing the
ITR value.

Since the original commit does not provide strong justification for this
RS bit "optimization", revert to the previous behavior of setting the RS
bit every 4th packet.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 34 ++++-------------------------
 1 file changed, 4 insertions(+), 30 deletions(-)

Comments

David Laight Oct. 9, 2017, 9:07 a.m. UTC | #1
From: Jeff Kirsher
> Sent: 06 October 2017 18:57
> From: Jacob Keller <jacob.e.keller@intel.com>
> 
> Since commit 6a7fded776a7 ("i40e: Fix RS bit update in Tx path and
> disable force WB workaround") we've tried to "optimize" setting the
> RS bit based around skb->xmit_more. This same logic was refactored
> in commit 1dc8b538795f ("i40e: Reorder logic for coalescing RS bits"),
> but ultimately was not functionally changed.

I've tried to understand the above, but without definitions of WB
and RS and the '4-ness' of something it is quite difficult.

I THINK this is all about telling the hardware there is a packet
in the ring to transmit?

I don't understand the 4-ness. Linux requires that the hardware
be notified of a single packet transmit, and that the 'transmit
complete' also be notified in 'a timely manner' for a single packet.

skb->xmit_more ought to be usable to optimise both of these
(assuming it isn't a lie).
The driver would need to ensure a ring full of messages isn't
generated without either wakeup - but that might be 128 frames.

FWIW on the systems I have PCIe writes are relatively cheap
(reads are slow). So different counts would be appropriate
for delaying doorbell writes and requesting completion interrupts.

	David
Alexander H Duyck Oct. 9, 2017, 4:28 p.m. UTC | #2
On Mon, Oct 9, 2017 at 2:07 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Jeff Kirsher
>> Sent: 06 October 2017 18:57
>> From: Jacob Keller <jacob.e.keller@intel.com>
>>
>> Since commit 6a7fded776a7 ("i40e: Fix RS bit update in Tx path and
>> disable force WB workaround") we've tried to "optimize" setting the
>> RS bit based around skb->xmit_more. This same logic was refactored
>> in commit 1dc8b538795f ("i40e: Reorder logic for coalescing RS bits"),
>> but ultimately was not functionally changed.
>
> I've tried to understand the above, but without definitions of WB
> and RS and the '4-ness' of something it is quite difficult.
>
> I THINK this is all about telling the hardware there is a packet
> in the ring to transmit?
>
> I don't understand the 4-ness. Linux requires that the hardware
> be notified of a single packet transmit, and that the 'transmit
> complete' also be notified in 'a timely manner' for a single packet.

So to clarify some of this. The RS is short for Report Status. It
tells the hardware that when it has completed the descriptor it should
trigger a write back, aka WB.

The 4-ness is because each descriptor is 16 bytes, so if we write back
once every 4 descriptors that is one 64B cache line which is much
better for most platforms performance wise.

> skb->xmit_more ought to be usable to optimise both of these
> (assuming it isn't a lie).

Actually it leads to issues if we hold off for too long. If we are
busy dequeueing a long list, and the Tx queue has gone empty then we
are stalling Tx without need to do so. We need to be regularly
notifying the device that it has buffers available to transmit which
is what xmit_more is good for. However in addition the hardware needs
to notify us regularly that there are buffers ready to clean which it
isn't necessarily so useful for, especially if  are filling the entire
ring and only providing one notification to the driver that there are
buffers ready since we can defer the Tx interrupt for too long.

What Jake is trying to fix here is to prevent us from generating what
looks like a square wave in terms of throughput. Essentially the
current behavior that is being seen is that all the buffers in the
ring either belong to software, or belong to hardware. It is best for
us to have this split half and half so that the hardware has buffers
to transmit and software has buffers to clean and enqueue new buffers
onto.

> The driver would need to ensure a ring full of messages isn't
> generated without either wakeup - but that might be 128 frames.

That is what is happening here. Basically we are guaranteeing
writebacks are occurring on a regular basis instead of in large
bursts.

> FWIW on the systems I have PCIe writes are relatively cheap
> (reads are slow). So different counts would be appropriate
> for delaying doorbell writes and requesting completion interrupts.
>
>         David

The doorbell writes are still being delayed the same amount with this
patch. The only thing that was split out was the device write backs
are now occurring on a more regular basis instead of being held off
until a much larger set is handled.

- Alex
Jacob Keller Oct. 9, 2017, 8:10 p.m. UTC | #3
> -----Original Message-----

> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]

> On Behalf Of Alexander Duyck

> Sent: Monday, October 09, 2017 9:28 AM

> To: David Laight <David.Laight@aculab.com>

> Cc: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>; davem@davemloft.net;

> Keller, Jacob E <jacob.e.keller@intel.com>; netdev@vger.kernel.org;

> nhorman@redhat.com; sassmann@redhat.com; jogreene@redhat.com

> Subject: Re: [net-next 14/15] i40e: ignore skb->xmit_more when deciding to set

> RS bit

> 

> On Mon, Oct 9, 2017 at 2:07 AM, David Laight <David.Laight@aculab.com> wrote:

> > From: Jeff Kirsher

> >> Sent: 06 October 2017 18:57

> >> From: Jacob Keller <jacob.e.keller@intel.com>

> >>

> >> Since commit 6a7fded776a7 ("i40e: Fix RS bit update in Tx path and

> >> disable force WB workaround") we've tried to "optimize" setting the

> >> RS bit based around skb->xmit_more. This same logic was refactored

> >> in commit 1dc8b538795f ("i40e: Reorder logic for coalescing RS bits"),

> >> but ultimately was not functionally changed.

> >

> > I've tried to understand the above, but without definitions of WB

> > and RS and the '4-ness' of something it is quite difficult.

> >

> > I THINK this is all about telling the hardware there is a packet

> > in the ring to transmit?

> >

> > I don't understand the 4-ness. Linux requires that the hardware

> > be notified of a single packet transmit, and that the 'transmit

> > complete' also be notified in 'a timely manner' for a single packet.

> 

> So to clarify some of this. The RS is short for Report Status. It

> tells the hardware that when it has completed the descriptor it should

> trigger a write back, aka WB.

> 

> The 4-ness is because each descriptor is 16 bytes, so if we write back

> once every 4 descriptors that is one 64B cache line which is much

> better for most platforms performance wise.

> 

> > skb->xmit_more ought to be usable to optimise both of these

> > (assuming it isn't a lie).

> 

> Actually it leads to issues if we hold off for too long. If we are

> busy dequeueing a long list, and the Tx queue has gone empty then we

> are stalling Tx without need to do so. We need to be regularly

> notifying the device that it has buffers available to transmit which

> is what xmit_more is good for. However in addition the hardware needs

> to notify us regularly that there are buffers ready to clean which it

> isn't necessarily so useful for, especially if  are filling the entire

> ring and only providing one notification to the driver that there are

> buffers ready since we can defer the Tx interrupt for too long.

> 

> What Jake is trying to fix here is to prevent us from generating what

> looks like a square wave in terms of throughput. Essentially the

> current behavior that is being seen is that all the buffers in the

> ring either belong to software, or belong to hardware. It is best for

> us to have this split half and half so that the hardware has buffers

> to transmit and software has buffers to clean and enqueue new buffers

> onto.

> 


Yes this sums it up pretty well. The test case which found this bug is outlined in the description. Essentially we could see up to dozens or even almost a hundred packets in a row bunched up if we had a bunch of virtual devices layered on top of the network driver. This led to us not indicating to get status from the hardware about finished packets, which results in an increased delay to finish Tx.

> > The driver would need to ensure a ring full of messages isn't

> > generated without either wakeup - but that might be 128 frames.

> 

> That is what is happening here. Basically we are guaranteeing

> writebacks are occurring on a regular basis instead of in large

> bursts.

> 

> > FWIW on the systems I have PCIe writes are relatively cheap

> > (reads are slow). So different counts would be appropriate

> > for delaying doorbell writes and requesting completion interrupts.

> >

> >         David

> 

> The doorbell writes are still being delayed the same amount with this

> patch. The only thing that was split out was the device write backs

> are now occurring on a more regular basis instead of being held off

> until a much larger set is handled.

> 


Exactly. The tail bumps are still functioning the same, but it's the hardware report status requests which are not delayed. Note that there's some further quirks in how our hardware performs write backs which exacerbate the problem because of how the cachelines are setup so that the delay is even worse than we would expect.

Thanks,
Jake

> - Alex
diff mbox series

Patch

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index d9fdf69bbc6e..3bd176606c09 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -3167,38 +3167,12 @@  static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 	/* write last descriptor with EOP bit */
 	td_cmd |= I40E_TX_DESC_CMD_EOP;
 
-	/* We can OR these values together as they both are checked against
-	 * 4 below and at this point desc_count will be used as a boolean value
-	 * after this if/else block.
+	/* We OR these values together to check both against 4 (WB_STRIDE)
+	 * below. This is safe since we don't re-use desc_count afterwards.
 	 */
 	desc_count |= ++tx_ring->packet_stride;
 
-	/* Algorithm to optimize tail and RS bit setting:
-	 * if queue is stopped
-	 *	mark RS bit
-	 *	reset packet counter
-	 * else if xmit_more is supported and is true
-	 *	advance packet counter to 4
-	 *	reset desc_count to 0
-	 *
-	 * if desc_count >= 4
-	 *	mark RS bit
-	 *	reset packet counter
-	 * if desc_count > 0
-	 *	update tail
-	 *
-	 * Note: If there are less than 4 descriptors
-	 * pending and interrupts were disabled the service task will
-	 * trigger a force WB.
-	 */
-	if (netif_xmit_stopped(txring_txq(tx_ring))) {
-		goto do_rs;
-	} else if (skb->xmit_more) {
-		/* set stride to arm on next packet and reset desc_count */
-		tx_ring->packet_stride = WB_STRIDE;
-		desc_count = 0;
-	} else if (desc_count >= WB_STRIDE) {
-do_rs:
+	if (desc_count >= WB_STRIDE) {
 		/* write last descriptor with RS bit set */
 		td_cmd |= I40E_TX_DESC_CMD_RS;
 		tx_ring->packet_stride = 0;
@@ -3219,7 +3193,7 @@  static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 	first->next_to_watch = tx_desc;
 
 	/* notify HW of packet */
-	if (desc_count) {
+	if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
 		writel(i, tx_ring->tail);
 
 		/* we need this if more than one processor can write to our tail