diff mbox

Throughput regression with `tcp: refine TSO autosizing`

Message ID CA+BoTQ=u_xPuqTVOVaFTQNRrJ+UTXe89SY+=+7Y1LpxxrkRDfg@mail.gmail.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Michal Kazior Feb. 10, 2015, 10:33 a.m. UTC
On 9 February 2015 at 16:11, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2015-02-09 at 14:47 +0100, Michal Kazior wrote:
[...]
> This is not what I suggested.
>
> If you test this on any other network device, you'll have
> sk->sk_tx_completion_delay_us == 0
>
> amount = 0 * (sk->sk_pacing_rate >> 10); --> 0
> limit = max(2 * skb->truesize, amount >> 10); --> 2 * skb->truesize

You're right. Sorry for mixing up.


> So non TSO/GSO NIC will not be able to queue more than 2 MSS (one MSS
> per skb)
>
> Then if you store only the last tx completion, you have the possibility
> of having a last packet of a train (say a retransmit) to make it very
> low.
>
> Ideally the formula would be in TCP something very fast to compute :
>
> amount = (sk->sk_pacing_rate >> 10) + sk->tx_completion_delay_cushion;
> limit = max(2 * skb->truesize, amount);
> limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>
> So a 'problematic' driver would have to do the math (64 bit maths) like
> this :
>
>
> sk->tx_completion_delay_cushion = ewma_tx_delay * sk->sk_pacing_rate;

Hmm. So I've done like you suggested (hopefully I didn't mix anything
up this time around).

I now get pre-regression performance, ~250mbps on 1 flow, 600mbps on 5
flows (vs 250mbps whatever number of flows).


Michał


                BUG_ON(!tso_segs);
@@ -2053,7 +2054,9 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
                 * of queued bytes to ensure line rate.
                 * One example is wifi aggregation (802.11 AMPDU)
                 */
-               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+               amount = (sk->sk_pacing_rate >> 10) +
+                        sk->sk_tx_completion_delay_cushion;
+               limit = max(2 * skb->truesize, amount);
                limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

                if (atomic_read(&sk->sk_wmem_alloc) > limit) {
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Feb. 10, 2015, 12:54 p.m. UTC | #1
On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:

> +       if (msdu->sk) {
> +               ewma_add(&ar->tx_delay_us,
> +                        ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
> +                        NSEC_PER_USEC);
> +
> +               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
> +                               (ewma_read(&ar->tx_delay_us) *
> +                                msdu->sk->sk_pacing_rate) >> 20;
> +       }
> +

Hi Michal

This is almost it ;)

As I said you must do this using u64 arithmetics, we still support 32bit
kernels.

Also, >> 20 instead of / 1000000 introduces a 5% error, I would use a
plain divide, as the compiler will use a reciprocal divide (ie : a
multiply)

We use >> 10 instead of /1000 because a 2.4 % error is probably okay.

        ewma_add(&ar->tx_delay_us,
                 ktime_to_ns(ktime_sub(ktime_get(),
skb_cb->stamp)) /
	                        NSEC_PER_USEC);
	u64 val = (u64)ewma_read(&ar->tx_delay_us) *
                   msdu->sk->sk_pacing_rate;

	do_div(val, USEC_PER_SEC);
		
        ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
                    (u32)val;
                 
(WRITE_ONCE() would be better for new kernels, but ACCESS_ONCE() is ok
since we probably want to backport to stable kernels)


Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 10, 2015, 1:05 p.m. UTC | #2
On Tue, 2015-02-10 at 04:54 -0800, Eric Dumazet wrote:

> Hi Michal
> 
> This is almost it ;)
> 
> As I said you must do this using u64 arithmetics, we still support 32bit
> kernels.
> 
> Also, >> 20 instead of / 1000000 introduces a 5% error, I would use a
> plain divide, as the compiler will use a reciprocal divide (ie : a
> multiply)
> 
> We use >> 10 instead of /1000 because a 2.4 % error is probably okay.
> 
>         ewma_add(&ar->tx_delay_us,
>                  ktime_to_ns(ktime_sub(ktime_get(),
> skb_cb->stamp)) /
> 	                        NSEC_PER_USEC);

btw I suspect this wont compile on 32 bit kernel

You need to use do_div() as well :

u64 val = ktime_to_ns(ktime_sub(ktime_get(),
				skb_cb->stamp));

do_div(val, NSEC_PER_USEC);

ewma_add(&ar->tx_delay_us, val);


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 10, 2015, 1:14 p.m. UTC | #3
On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
>                            ath10k_core_napi_dummy_poll, 64);
> +       ewma_init(&ar->tx_delay_us, 16384, 8);


1) 16384 factor might be too big.

2) a weight of 8 seems too low given aggregation values used in wifi.

On 32bit arches, the max range for ewma value would be 262144 usec,
a quarter of a second...

You could use a factor of 64 instead, and a weight of 16.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Feb. 10, 2015, 2:19 p.m. UTC | #4
On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:

> +       if (msdu->sk) {
> +               ewma_add(&ar->tx_delay_us,
> +                        ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
> +                        NSEC_PER_USEC);
> +
> +               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
> +                               (ewma_read(&ar->tx_delay_us) *
> +                                msdu->sk->sk_pacing_rate) >> 20;
> +       }

To some extent, every wifi driver is going to have this problem. Perhaps
we should do this in mac80211?

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 10, 2015, 3:09 p.m. UTC | #5
On Tue, 2015-02-10 at 15:19 +0100, Johannes Berg wrote:
> On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
> 
> > +       if (msdu->sk) {
> > +               ewma_add(&ar->tx_delay_us,
> > +                        ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
> > +                        NSEC_PER_USEC);
> > +
> > +               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
> > +                               (ewma_read(&ar->tx_delay_us) *
> > +                                msdu->sk->sk_pacing_rate) >> 20;
> > +       }
> 
> To some extent, every wifi driver is going to have this problem. Perhaps
> we should do this in mac80211?

I'll provide the TCP patch.

sk->sk_tx_completion_delay_cushion is probably a wrong name, as the
units here are in bytes, since it is really number of bytes in the
network driver that accommodate for tx completions delays. 

tx_completion_delay * pacing_rate

sk_tx_completion_cushion maybe.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Kazior Feb. 11, 2015, 8:33 a.m. UTC | #6
On 10 February 2015 at 14:14, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
>>                            ath10k_core_napi_dummy_poll, 64);
>> +       ewma_init(&ar->tx_delay_us, 16384, 8);
>
>
> 1) 16384 factor might be too big.
>
> 2) a weight of 8 seems too low given aggregation values used in wifi.
>
> On 32bit arches, the max range for ewma value would be 262144 usec,
> a quarter of a second...
>
> You could use a factor of 64 instead, and a weight of 16.

64/16 seems to work fine as well.

On a related note: I still wonder how to get single TCP flow to reach
line rate with ath10k (it still doesn't; I reach line rate with
multiple flows only). Isn't the tcp_limit_output_bytes just too small
for devices like Wi-Fi where you can send aggregates of even 64*3*1500
bytes long in a single shot and you can't expect even a single
tx-completion of it to come in before its transmitted entirely? You
effectively operate with bursts of traffic.

Some numbers:
 ath10k w/o cushion w/o aggregation 1 flow: UDP 65mbps, TCP 30mbps
 ath10k w/ cushion w/o aggregation 1 flow: UDP 65mbps, TCP 59mbps
 ath10k w/o cushion w/ aggregation 1 flow: UDP 650mbps, TCP 250mbps
 ath10k w/ cushion w/ aggregation 1 flow: UDP 650mbps, TCP 250mbps
 ath10k w/o cushion w/ aggregation 5 flows: UDP 650mbps, TCP 250mbps
 ath10k w/ cushion w/ aggregation 5 flows: UDP 650mbps, TCP 600mbps

"w/o aggregation" means forcing ath10k to use 1 A-MSDU and 1 A-MPDU
per aggregate so latencies due to aggregation itself should be pretty
much nil.

If I set tcp_limit_output_bytes to 700K+ I can get ath10k w/ cushion
w/ aggregation to reach 600mbps on a single flow.


Michał
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Kazior Feb. 11, 2015, 8:57 a.m. UTC | #7
On 10 February 2015 at 15:19, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
>
>> +       if (msdu->sk) {
>> +               ewma_add(&ar->tx_delay_us,
>> +                        ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
>> +                        NSEC_PER_USEC);
>> +
>> +               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
>> +                               (ewma_read(&ar->tx_delay_us) *
>> +                                msdu->sk->sk_pacing_rate) >> 20;
>> +       }
>
> To some extent, every wifi driver is going to have this problem. Perhaps
> we should do this in mac80211?

Good point. I was actually thinking about it. I can try cooking a
patch unless you want to do it yourself :-)


Michał
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 11, 2015, 1:17 p.m. UTC | #8
On Wed, 2015-02-11 at 09:33 +0100, Michal Kazior wrote:

> If I set tcp_limit_output_bytes to 700K+ I can get ath10k w/ cushion
> w/ aggregation to reach 600mbps on a single flow.

You know, there is a reason this sysctl exists in the first place ;)

The first suggestion I made to you was to raise it.

The default setting must stay as is as long default Qdisc is pfifo_fast.

I believe I already mentioned skb->truesize tricks for drivers willing
to adjust the TSQ given their constraints.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Kazior Feb. 12, 2015, 7:16 a.m. UTC | #9
On 11 February 2015 at 14:17, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2015-02-11 at 09:33 +0100, Michal Kazior wrote:
>
>> If I set tcp_limit_output_bytes to 700K+ I can get ath10k w/ cushion
>> w/ aggregation to reach 600mbps on a single flow.
>
> You know, there is a reason this sysctl exists in the first place ;)
>
> The first suggestion I made to you was to raise it.
>
> The default setting must stay as is as long default Qdisc is pfifo_fast.
>
> I believe I already mentioned skb->truesize tricks for drivers willing
> to adjust the TSQ given their constraints.

Right. truesize didn't help in my early tests and once the cushion
thing came about I had assumed that it's not relevant anymore.

I just checked:

@@ -2620,6 +2621,12 @@ static void ath10k_tx(struct ieee80211_hw *hw,
        if (info->flags & IEEE80211_TX_CTL_NO_CCK_RATE)
                ath10k_dbg(ar, ATH10K_DBG_MAC,
"IEEE80211_TX_CTL_NO_CCK_RATE\n");

+       if (skb->sk) {
+               u32 trim = skb->truesize - (skb->truesize / 8);
+               skb->truesize -= trim;
+               atomic_sub(trim, &skb->sk->sk_wmem_alloc);
+       }

With this I get 600mbps on a single flow. The /2 wasn't enough (it
barely made a difference, 250->300mbps). The question is how do I know
how much of trimming is too much? Could the tx completion delay be
used to compute the trim factor, hmm..

Maybe this should be done in mac80211 as well?


Michał
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Kazior Feb. 12, 2015, 7:48 a.m. UTC | #10
On 11 February 2015 at 09:57, Michal Kazior <michal.kazior@tieto.com> wrote:
> On 10 February 2015 at 15:19, Johannes Berg <johannes@sipsolutions.net> wrote:
>> On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
>>
>>> +       if (msdu->sk) {
>>> +               ewma_add(&ar->tx_delay_us,
>>> +                        ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
>>> +                        NSEC_PER_USEC);
>>> +
>>> +               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
>>> +                               (ewma_read(&ar->tx_delay_us) *
>>> +                                msdu->sk->sk_pacing_rate) >> 20;
>>> +       }
>>
>> To some extent, every wifi driver is going to have this problem. Perhaps
>> we should do this in mac80211?
>
> Good point. I was actually thinking about it. I can try cooking a
> patch unless you want to do it yourself :-)

I've taken a look into this. The most obvious place to add the
timestamp for each packet would be ieee80211_tx_info (i.e. the
skb->cb[48]). The problem is it's very tight there. Even squeezing 2
bytes (allowing up to 64ms of tx completion delay which I'm worried
won't be enough) will be troublesome. Some drivers already use every
last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
40 bytes of driver_data).

I wonder if it's okay to bump skb->cb to 56 bytes to avoid the cascade
of changes required to implement the tx completion delay accounting?


Michał
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Taht Feb. 12, 2015, 8:33 a.m. UTC | #11
On Wed, Feb 11, 2015 at 11:48 PM, Michal Kazior <michal.kazior@tieto.com> wrote:
> On 11 February 2015 at 09:57, Michal Kazior <michal.kazior@tieto.com> wrote:
>> On 10 February 2015 at 15:19, Johannes Berg <johannes@sipsolutions.net> wrote:
>>> On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
>>>
>>>> +       if (msdu->sk) {
>>>> +               ewma_add(&ar->tx_delay_us,
>>>> +                        ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
>>>> +                        NSEC_PER_USEC);
>>>> +
>>>> +               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
>>>> +                               (ewma_read(&ar->tx_delay_us) *
>>>> +                                msdu->sk->sk_pacing_rate) >> 20;
>>>> +       }
>>>
>>> To some extent, every wifi driver is going to have this problem. Perhaps
>>> we should do this in mac80211?
>>
>> Good point. I was actually thinking about it. I can try cooking a
>> patch unless you want to do it yourself :-)
>
> I've taken a look into this. The most obvious place to add the
> timestamp for each packet would be ieee80211_tx_info (i.e. the
> skb->cb[48]). The problem is it's very tight there. Even squeezing 2
> bytes (allowing up to 64ms of tx completion delay which I'm worried

I will argue strongly in favor of never allowing more than 4ms packets
to accumulate in the firmware.

> won't be enough) will be troublesome. Some drivers already use every
> last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
> 40 bytes of driver_data).
>
> I wonder if it's okay to bump skb->cb to 56 bytes to avoid the cascade
> of changes required to implement the tx completion delay accounting?
>
>
> Michał
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Feb. 24, 2015, 10:24 a.m. UTC | #12
On Thu, 2015-02-12 at 08:48 +0100, Michal Kazior wrote:

> > Good point. I was actually thinking about it. I can try cooking a
> > patch unless you want to do it yourself :-)
> 
> I've taken a look into this. The most obvious place to add the
> timestamp for each packet would be ieee80211_tx_info (i.e. the
> skb->cb[48]). The problem is it's very tight there. Even squeezing 2
> bytes (allowing up to 64ms of tx completion delay which I'm worried
> won't be enough) will be troublesome. Some drivers already use every
> last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
> 40 bytes of driver_data).

Couldn't we just repurpose the existing skb->tstamp field for this, as
long as the skb is fully contained within the wireless layer?

Actually, it looks like we can't, since I guess timestamping options can
be turned on on any socket.

> I wonder if it's okay to bump skb->cb to 56 bytes to avoid the cascade
> of changes required to implement the tx completion delay accounting?

I have no doubt that would be rejected :)

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Feb. 24, 2015, 10:30 a.m. UTC | #13
On Tue, 2015-02-24 at 11:24 +0100, Johannes Berg wrote:
> On Thu, 2015-02-12 at 08:48 +0100, Michal Kazior wrote:
> 
> > > Good point. I was actually thinking about it. I can try cooking a
> > > patch unless you want to do it yourself :-)
> > 
> > I've taken a look into this. The most obvious place to add the
> > timestamp for each packet would be ieee80211_tx_info (i.e. the
> > skb->cb[48]). The problem is it's very tight there. Even squeezing 2
> > bytes (allowing up to 64ms of tx completion delay which I'm worried
> > won't be enough) will be troublesome. Some drivers already use every
> > last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
> > 40 bytes of driver_data).
> 
> Couldn't we just repurpose the existing skb->tstamp field for this, as
> long as the skb is fully contained within the wireless layer?
> 
> Actually, it looks like we can't, since I guess timestamping options can
> be turned on on any socket.

Actually, that creates a clone or a new skb? Hmm.

Anyway, I think we should just move the vif and hw_key pointers out of
the tx_info and into ieee80211_tx_control. That will give us plenty of
space, and they can't really be used long-term anyway? Although ... both
might be somewhat problematic for mac80211 itself? Hmm.

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Feb. 24, 2015, 10:59 a.m. UTC | #14
On Tue, 2015-02-24 at 11:30 +0100, Johannes Berg wrote:
> On Tue, 2015-02-24 at 11:24 +0100, Johannes Berg wrote:
> > On Thu, 2015-02-12 at 08:48 +0100, Michal Kazior wrote:
> > 
> > > > Good point. I was actually thinking about it. I can try cooking a
> > > > patch unless you want to do it yourself :-)
> > > 
> > > I've taken a look into this. The most obvious place to add the
> > > timestamp for each packet would be ieee80211_tx_info (i.e. the
> > > skb->cb[48]). The problem is it's very tight there. Even squeezing 2
> > > bytes (allowing up to 64ms of tx completion delay which I'm worried
> > > won't be enough) will be troublesome. Some drivers already use every
> > > last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
> > > 40 bytes of driver_data).
> > 
> > Couldn't we just repurpose the existing skb->tstamp field for this, as
> > long as the skb is fully contained within the wireless layer?
> > 
> > Actually, it looks like we can't, since I guess timestamping options can
> > be turned on on any socket.
> 
> Actually, that creates a clone or a new skb? Hmm.

Ah and then it puts it on the error queue right away, so I think we can
reuse it.

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/wireless/ath/ath10k/core.c
b/drivers/net/wireless/ath/ath10k/core.c
index 367e896..a29111c 100644
--- a/drivers/net/wireless/ath/ath10k/core.c
+++ b/drivers/net/wireless/ath/ath10k/core.c
@@ -18,6 +18,7 @@ 
 #include <linux/module.h>
 #include <linux/firmware.h>
 #include <linux/of.h>
+#include <linux/average.h>

 #include "core.h"
 #include "mac.h"
@@ -1423,6 +1424,7 @@  struct ath10k *ath10k_core_create(size_t
priv_size, struct device *dev,
        init_dummy_netdev(&ar->napi_dev);
        ieee80211_napi_add(ar->hw, &ar->napi, &ar->napi_dev,
                           ath10k_core_napi_dummy_poll, 64);
+       ewma_init(&ar->tx_delay_us, 16384, 8);

        ret = ath10k_debug_create(ar);
        if (ret)
diff --git a/drivers/net/wireless/ath/ath10k/core.h
b/drivers/net/wireless/ath/ath10k/core.h
index 3be3a59..34f6d78 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -24,6 +24,7 @@ 
 #include <linux/pci.h>
 #include <linux/uuid.h>
 #include <linux/time.h>
+#include <linux/average.h>

 #include "htt.h"
 #include "htc.h"
@@ -82,6 +83,7 @@  struct ath10k_skb_cb {
        dma_addr_t paddr;
        u8 eid;
        u8 vdev_id;
+       ktime_t stamp;

        struct {
                u8 tid;
@@ -625,6 +627,7 @@  struct ath10k {

        struct net_device napi_dev;
        struct napi_struct napi;
+       struct ewma tx_delay_us;

 #ifdef CONFIG_ATH10K_DEBUGFS
        struct ath10k_debug debug;
diff --git a/drivers/net/wireless/ath/ath10k/mac.c
b/drivers/net/wireless/ath/ath10k/mac.c
index 15e47f4..5efb2a7 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -2620,6 +2620,7 @@  static void ath10k_tx(struct ieee80211_hw *hw,
        if (info->flags & IEEE80211_TX_CTL_NO_CCK_RATE)
                ath10k_dbg(ar, ATH10K_DBG_MAC,
"IEEE80211_TX_CTL_NO_CCK_RATE\n");

+       ATH10K_SKB_CB(skb)->stamp = ktime_get();
        ATH10K_SKB_CB(skb)->htt.is_offchan = false;
        ATH10K_SKB_CB(skb)->htt.tid = ath10k_tx_h_get_tid(hdr);
        ATH10K_SKB_CB(skb)->vdev_id = ath10k_tx_h_get_vdev_id(ar, vif);
diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
b/drivers/net/wireless/ath/ath10k/txrx.c
index 3f00cec..0f5f0f2 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -15,6 +15,8 @@ 
  * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
  */

+#include <net/sock.h>
+#include <linux/average.h>
 #include "core.h"
 #include "txrx.h"
 #include "htt.h"
@@ -82,6 +84,16 @@  void ath10k_txrx_tx_unref(struct ath10k_htt *htt,

        ath10k_report_offchan_tx(htt->ar, msdu);

+       if (msdu->sk) {
+               ewma_add(&ar->tx_delay_us,
+                        ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
+                        NSEC_PER_USEC);
+
+               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
+                               (ewma_read(&ar->tx_delay_us) *
+                                msdu->sk->sk_pacing_rate) >> 20;
+       }
+
        info = IEEE80211_SKB_CB(msdu);
        memset(&info->status, 0, sizeof(info->status));
        trace_ath10k_txrx_tx_unref(ar, tx_done->msdu_id);
diff --git a/include/net/sock.h b/include/net/sock.h
index 2210fec..6772543 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -391,6 +391,7 @@  struct sock {
        gfp_t                   sk_allocation;
        u32                     sk_pacing_rate; /* bytes per second */
        u32                     sk_max_pacing_rate;
+       u32                     sk_tx_completion_delay_cushion;
        netdev_features_t       sk_route_caps;
        netdev_features_t       sk_route_nocaps;
        int                     sk_gso_type;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b..526a568 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1996,6 +1996,7 @@  static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
        max_segs = tcp_tso_autosize(sk, mss_now);
        while ((skb = tcp_send_head(sk))) {
                unsigned int limit;
+               unsigned int amount;

                tso_segs = tcp_init_tso_segs(sk, skb, mss_now);