diff mbox

[v2,net-next] tcp: avoid expensive pskb_expand_head() calls

Message ID 1334778707.2472.333.camel@edumazet-glaptop
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet April 18, 2012, 7:51 p.m. UTC
From: Eric Dumazet <edumazet@google.com>

While doing netperf sessions on 10Gb Intel nics (ixgbe), I noticed
unexpected profiling results, with pskb_expand_head() being in the top.

After further analysis, I found we hit badly page refcounts,
because when we transmit full size skb (64 KB), we can receive ACK for
the first segments of the frame while skb was not completely sent by
NIC.

It takes ~54 us to send a full TSO packet at 10Gb speed, but with a
close peer, we can receive TCP ACK in less than 50 us rtt.

This is also true on 1Gb links but we were limited by wire speed, not
cpu.

When we try to trim skb, tcp_trim_head() has to call pskb_expand_head(),
because the skb clone we did for transmit is still alive in TX ring
buffer.

pskb_expand_head() is really expensive : It has to make about 16+16
atomic operations on page refcounts, not counting the skb head
reallocation/copy. It increases chances of false sharing.

In fact, we dont really need to trim skb. This costly operation can be
delayed to the point it is really needed : Thats when a retransmit must
happen.

Most of the time, upcoming ACKS will ack the whole packet, and we can
free it with minimal cost (since clone was already freed by TX
completion)

Of course, this means we dont uncharge the acked part from socket limits
until retransmit, but this is hardly a concern with current autotuning
(around 4MB per socket)
Even with small cwnd limit, a single packet can not hold more than half
the window.

Performance results on my Q6600 cpu and 82599EB 10-Gigabit card :
About 3% less cpu used for same workload (single netperf TCP_STREAM),
bounded by x4 PCI-e slots (4660 Mbits).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
v2 : added Neal suggestions

 include/net/tcp.h     |    6 ++++--
 net/ipv4/tcp_input.c  |   22 +++++++++++-----------
 net/ipv4/tcp_output.c |   25 +++++++++++++++++--------
 3 files changed, 32 insertions(+), 21 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Ilpo Järvinen April 19, 2012, 11:10 a.m. UTC | #1
On Wed, 18 Apr 2012, Eric Dumazet wrote:

> From: Eric Dumazet <edumazet@google.com>
> 
> While doing netperf sessions on 10Gb Intel nics (ixgbe), I noticed
> unexpected profiling results, with pskb_expand_head() being in the top.
> 
> After further analysis, I found we hit badly page refcounts,
> because when we transmit full size skb (64 KB), we can receive ACK for
> the first segments of the frame while skb was not completely sent by
> NIC.
> 
> It takes ~54 us to send a full TSO packet at 10Gb speed, but with a
> close peer, we can receive TCP ACK in less than 50 us rtt.
> 
> This is also true on 1Gb links but we were limited by wire speed, not
> cpu.
> 
> When we try to trim skb, tcp_trim_head() has to call pskb_expand_head(),
> because the skb clone we did for transmit is still alive in TX ring
> buffer.
> 
> pskb_expand_head() is really expensive : It has to make about 16+16
> atomic operations on page refcounts, not counting the skb head
> reallocation/copy. It increases chances of false sharing.
> 
> In fact, we dont really need to trim skb. This costly operation can be
> delayed to the point it is really needed : Thats when a retransmit must
> happen.
> 
> Most of the time, upcoming ACKS will ack the whole packet, and we can
> free it with minimal cost (since clone was already freed by TX
> completion)
> 
> Of course, this means we dont uncharge the acked part from socket limits
> until retransmit, but this is hardly a concern with current autotuning
> (around 4MB per socket)
> Even with small cwnd limit, a single packet can not hold more than half
> the window.
> 
> Performance results on my Q6600 cpu and 82599EB 10-Gigabit card :
> About 3% less cpu used for same workload (single netperf TCP_STREAM),
> bounded by x4 PCI-e slots (4660 Mbits).
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tom Herbert <therbert@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Maciej Żenczykowski <maze@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> ---
> v2 : added Neal suggestions
> 
>  include/net/tcp.h     |    6 ++++--
>  net/ipv4/tcp_input.c  |   22 +++++++++++-----------
>  net/ipv4/tcp_output.c |   25 +++++++++++++++++--------
>  3 files changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index d5984e3..0f57706 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -477,7 +477,8 @@ extern int tcp_retransmit_skb(struct sock *, struct sk_buff *);
>  extern void tcp_retransmit_timer(struct sock *sk);
>  extern void tcp_xmit_retransmit_queue(struct sock *);
>  extern void tcp_simple_retransmit(struct sock *);
> -extern int tcp_trim_head(struct sock *, struct sk_buff *, u32);
> +extern void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
> +				 unsigned int mss_now);
>  extern int tcp_fragment(struct sock *, struct sk_buff *, u32, unsigned int);
>  
>  extern void tcp_send_probe0(struct sock *);
> @@ -640,7 +641,8 @@ struct tcp_skb_cb {
>  #if IS_ENABLED(CONFIG_IPV6)
>  		struct inet6_skb_parm	h6;
>  #endif
> -	} header;	/* For incoming frames		*/
> +		unsigned int offset_ack; /* part of acked data in this skb */
> +	} header;
>  	__u32		seq;		/* Starting sequence number	*/
>  	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
>  	__u32		when;		/* used to compute rtt's	*/
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 99448f0..bdec2e6 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3260,25 +3260,25 @@ static void tcp_rearm_rto(struct sock *sk)
>  	}
>  }
>  
> -/* If we get here, the whole TSO packet has not been acked. */
> +/* If we get here, the whole packet has not been acked.
> + * We used to call tcp_trim_head() to remove acked data from skb,
> + * but its expensive with TSO if our previous clone is still in flight.
> + * We thus maintain an offset_ack, and hope no pskb_expand_head()
> + * is needed until whole packet is acked by upcoming ACKs.
> + */
>  static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
>  {
>  	struct tcp_sock *tp = tcp_sk(sk);
> -	u32 packets_acked;
> +	u32 oldpcount = tcp_skb_pcount(skb);
>  
>  	BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una));
>  
> -	packets_acked = tcp_skb_pcount(skb);
> -	if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
> -		return 0;
> -	packets_acked -= tcp_skb_pcount(skb);
> +	TCP_SKB_CB(skb)->header.offset_ack = tp->snd_una - TCP_SKB_CB(skb)->seq;

Now that you have non-zero offset_ack, are the tcp_fragment() callsites 
safe and working? ...I'm mostly worried about tcp_mark_head_lost which 
does some assumptions about tp->snd_una and TCP_SKB_CB(skb)->seq, however, 
also other fragmenting does not preserve offset_ack properly (which might 
not be end of world though)?
Eric Dumazet April 19, 2012, 11:30 a.m. UTC | #2
On Thu, 2012-04-19 at 14:10 +0300, Ilpo Järvinen wrote:

> Now that you have non-zero offset_ack, are the tcp_fragment() callsites 
> safe and working? ...I'm mostly worried about tcp_mark_head_lost which 
> does some assumptions about tp->snd_una and TCP_SKB_CB(skb)->seq, however, 
> also other fragmenting does not preserve offset_ack properly (which might 
> not be end of world though)?

Good point, I'll take a look.

I'll provide a v3 anyway with more performance data, I setup two cards
in PCI x8 slots to get full bandwidth.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 19, 2012, 11:40 a.m. UTC | #3
On Thu, 2012-04-19 at 13:30 +0200, Eric Dumazet wrote:
> On Thu, 2012-04-19 at 14:10 +0300, Ilpo Järvinen wrote:
> 
> > Now that you have non-zero offset_ack, are the tcp_fragment() callsites 
> > safe and working? ...I'm mostly worried about tcp_mark_head_lost which 
> > does some assumptions about tp->snd_una and TCP_SKB_CB(skb)->seq, however, 
> > also other fragmenting does not preserve offset_ack properly (which might 
> > not be end of world though)?
> 
> Good point, I'll take a look.

Hmm, the only point this could matter is if a packet is retransmitted.
For other packets, offset_ack = 0 (default value on skb allocation)

And tcp_retransmit_skb() first call tcp_trim_head(sk, skb) if needed so
tcp_fragment() is called with == 0

        if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
                if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
                        BUG();
                if (tcp_trim_head(sk, skb))
                        return -ENOMEM;
        }

...
if (skb->len > cur_mss) {
	if (tcp_fragment(sk, skb, cur_mss, cur_mss))



I could add a BUG_ON(offset_ack == 0) to make sure this assertion is
true.

What do you think ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen April 19, 2012, 11:57 a.m. UTC | #4
On Thu, 19 Apr 2012, Eric Dumazet wrote:

> On Thu, 2012-04-19 at 13:30 +0200, Eric Dumazet wrote:
> > On Thu, 2012-04-19 at 14:10 +0300, Ilpo Järvinen wrote:
> > 
> > > Now that you have non-zero offset_ack, are the tcp_fragment() callsites 
> > > safe and working? ...I'm mostly worried about tcp_mark_head_lost which 
> > > does some assumptions about tp->snd_una and TCP_SKB_CB(skb)->seq, however, 
> > > also other fragmenting does not preserve offset_ack properly (which might 
> > > not be end of world though)?
> > 
> > Good point, I'll take a look.
> 
> Hmm, the only point this could matter is if a packet is retransmitted.
>
> For other packets, offset_ack = 0 (default value on skb allocation)
> 
> And tcp_retransmit_skb() first call tcp_trim_head(sk, skb) if needed so
> tcp_fragment() is called with == 0
> 
>         if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
>                 if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
>                         BUG();
>                 if (tcp_trim_head(sk, skb))
>                         return -ENOMEM;
>         }
> 
> ...
> if (skb->len > cur_mss) {
> 	if (tcp_fragment(sk, skb, cur_mss, cur_mss))
> 
> 
> 
> I could add a BUG_ON(offset_ack == 0) to make sure this assertion is
> true.

If you end up putting something like that make sure you use WARN_ON 
instead as this surely isn't fatal enough to warrant full stop of the
box :-).

> What do you think ?

I'm not concerned of the output side, that seems to work because 
of the in tcp_retransmit_skb getting rid of the extra first.

The ACK input stuff is more interesting, e.g., this one in 
tcp_mark_head_lost:

	err = tcp_fragment(sk, skb, (packets - oldcnt) * mss, mss);

It splits from TCP_SKB_CB(skb)->seq + (packets - oldcnt) * mss whereas
I think the desired point would be: TCP_SKB_CB(skb)->seq + offset_ack + 
(packets - oldcnt) * mss?

...There is similar case in sacktag code too while it's aligning to mss 
boundaries in tcp_match_skb_to_sack.
Eric Dumazet April 19, 2012, 12:44 p.m. UTC | #5
On Thu, 2012-04-19 at 14:57 +0300, Ilpo Järvinen wrote:

> I'm not concerned of the output side, that seems to work because 
> of the in tcp_retransmit_skb getting rid of the extra first.
> 
> The ACK input stuff is more interesting, e.g., this one in 
> tcp_mark_head_lost:
> 
> 	err = tcp_fragment(sk, skb, (packets - oldcnt) * mss, mss);
> 
> It splits from TCP_SKB_CB(skb)->seq + (packets - oldcnt) * mss whereas
> I think the desired point would be: TCP_SKB_CB(skb)->seq + offset_ack + 
> (packets - oldcnt) * mss?
> 
> ...There is similar case in sacktag code too while it's aligning to mss 
> boundaries in tcp_match_skb_to_sack.

Hmm yes, so maybe its safer to update TCP_SKB_CB(skb)->seq in
tcp_tso_acked() (as well as offset_ack) and make needed adjustements in
tcp_fragment() if we find offset_ack being not null.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 19, 2012, 1:18 p.m. UTC | #6
On Thu, 2012-04-19 at 13:30 +0200, Eric Dumazet wrote:

> I'll provide a v3 anyway with more performance data, I setup two cards
> in PCI x8 slots to get full bandwidth.

Incidentally, using PCI x8 slots dont anymore trigger the slow path on
unpatched kernel and a single flow (~9410 Mbits)

It seems we are lucky enough to TX complete sent clones before trying to
tcp_trim_head() when processing ACK

Sounds like a timing issue, and fact that drivers batches TX completions
and RX completions.

Also BQL might have changed things a bit here (ixgbe is BQL enabled)

Only if I start several concurrent flows I see the pskb_expand_head()
overhead.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 19, 2012, 1:52 p.m. UTC | #7
On Thu, 2012-04-19 at 15:18 +0200, Eric Dumazet wrote:
> On Thu, 2012-04-19 at 13:30 +0200, Eric Dumazet wrote:
> 
> > I'll provide a v3 anyway with more performance data, I setup two cards
> > in PCI x8 slots to get full bandwidth.
> 
> Incidentally, using PCI x8 slots dont anymore trigger the slow path on
> unpatched kernel and a single flow (~9410 Mbits)
> 
> It seems we are lucky enough to TX complete sent clones before trying to
> tcp_trim_head() when processing ACK
> 
> Sounds like a timing issue, and fact that drivers batches TX completions
> and RX completions.
> 
> Also BQL might have changed things a bit here (ixgbe is BQL enabled)
> 
> Only if I start several concurrent flows I see the pskb_expand_head()
> overhead.
> 
> 

And disabling GRO on receiver definitely demonstrates the problem, even
with a single flow. (and performance drops from 9410 Mbit to 6050 Mbit)





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 19, 2012, 2:10 p.m. UTC | #8
On Thu, 2012-04-19 at 15:52 +0200, Eric Dumazet wrote:

> And disabling GRO on receiver definitely demonstrates the problem, even
> with a single flow. (and performance drops from 9410 Mbit to 6050 Mbit)

That insane. 

Performance drops so much because we _drop_ incoming ACKS :

<     TCPSackShifted: 39117
<     TCPSackMerged: 16500
<     TCPSackShiftFallback: 5092
<     TCPBacklogDrop: 27965
---
>     TCPSackShifted: 35122
>     TCPSackMerged: 16368
>     TCPSackShiftFallback: 4889
>     TCPBacklogDrop: 23247

Hmm, maybe we should reduce skb->truesize for small packets before
queueing them in socket backlog...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones April 19, 2012, 5:20 p.m. UTC | #9
On 04/19/2012 07:10 AM, Eric Dumazet wrote:
> On Thu, 2012-04-19 at 15:52 +0200, Eric Dumazet wrote:
>
>> And disabling GRO on receiver definitely demonstrates the problem, even
>> with a single flow. (and performance drops from 9410 Mbit to 6050 Mbit)
>
> That insane.
>
> Performance drops so much because we _drop_ incoming ACKS :
>
> <      TCPSackShifted: 39117
> <      TCPSackMerged: 16500
> <      TCPSackShiftFallback: 5092
> <      TCPBacklogDrop: 27965
> ---
>>      TCPSackShifted: 35122
>>      TCPSackMerged: 16368
>>      TCPSackShiftFallback: 4889
>>      TCPBacklogDrop: 23247
>
> Hmm, maybe we should reduce skb->truesize for small packets before
> queueing them in socket backlog...

By copying them to smaller buffers? Or just by altering truesize? 
Wasn't the whole point of fixing all the broken truesize settings to 
accurately account for memory consumed?

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 19, 2012, 5:25 p.m. UTC | #10
On Thu, 2012-04-19 at 10:20 -0700, Rick Jones wrote:

> By copying them to smaller buffers? Or just by altering truesize? 
> Wasn't the whole point of fixing all the broken truesize settings to 
> accurately account for memory consumed?

I checked, their truesize are OK (1024+256) for ixgbe driver.
They could be little smaller, but not that much. (512 + 256)

No, its only the sk_rcvbuf is small for a tcp sender,
and sk_add_backlog() makes sure we dont queue more than sk_rcvbuf bytes
in backlog.





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones April 19, 2012, 5:48 p.m. UTC | #11
On 04/19/2012 10:25 AM, Eric Dumazet wrote:
> On Thu, 2012-04-19 at 10:20 -0700, Rick Jones wrote:
>
>> By copying them to smaller buffers? Or just by altering truesize?
>> Wasn't the whole point of fixing all the broken truesize settings to
>> accurately account for memory consumed?
>
> I checked, their truesize are OK (1024+256) for ixgbe driver.
> They could be little smaller, but not that much. (512 + 256)
>
> No, its only the sk_rcvbuf is small for a tcp sender,
> and sk_add_backlog() makes sure we dont queue more than sk_rcvbuf
> bytes in backlog.

Sounds like a variation on the theme of wildly divergent 
inbound/outbound bandwidth and constraining ACKs constraining throughput 
- only with buffer sizes.

87380 is the default SO_RCVBUF right?  That should have allowed 
87380/1280 or 68 ACKs to be queued.  Without ACK stretching from GRO 
that should have covered 68 * 2896 or 196928 bytes.  Based on your 
previous 54 usec to transmit 64 KB that would be 162+ usecs to 
accumulate those ACKs, so I guess a question becomes if TCP can be 
held-off processing ACKs for > 162 usecs, and if so and that cannot be 
changed, the autotuning will have to start increasing SO_SNDBUF 
alongside so_sndbuf even if the endpoint is not receiving.  As a 
handwave, since TCP does not know the buffer size(s) used by the driver, 
by 1 MSS for every 2 MSS it adds to SO_SNDBUF.  Or find some way to do 
it "on demand" in the about to drop path.

That or bare ACKs have to be excluded from the overhead checks somehow I 
guess, or there be more aggressive copying into smaller buffers.

Thankfully, when applications make explicit setsockopt() calls, they 
tend (ok, perhaps that is a bit of a guess) to set both SO_SNDBUF and 
SO_RCVBUF at the same time.

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 19, 2012, 6 p.m. UTC | #12
On Thu, 2012-04-19 at 10:48 -0700, Rick Jones wrote:
> On 04/19/2012 10:25 AM, Eric Dumazet wrote:
> > On Thu, 2012-04-19 at 10:20 -0700, Rick Jones wrote:
> >
> >> By copying them to smaller buffers? Or just by altering truesize?
> >> Wasn't the whole point of fixing all the broken truesize settings to
> >> accurately account for memory consumed?
> >
> > I checked, their truesize are OK (1024+256) for ixgbe driver.
> > They could be little smaller, but not that much. (512 + 256)
> >
> > No, its only the sk_rcvbuf is small for a tcp sender,
> > and sk_add_backlog() makes sure we dont queue more than sk_rcvbuf
> > bytes in backlog.
> 
> Sounds like a variation on the theme of wildly divergent 
> inbound/outbound bandwidth and constraining ACKs constraining throughput 
> - only with buffer sizes.
> 
> 87380 is the default SO_RCVBUF right?  That should have allowed 
> 87380/1280 or 68 ACKs to be queued.  Without ACK stretching from GRO 
> that should have covered 68 * 2896 or 196928 bytes.  Based on your 
> previous 54 usec to transmit 64 KB that would be 162+ usecs to 
> accumulate those ACKs, so I guess a question becomes if TCP can be 
> held-off processing ACKs for > 162 usecs, and if so and that cannot be 
> changed, the autotuning will have to start increasing SO_SNDBUF 
> alongside so_sndbuf even if the endpoint is not receiving.  As a 
> handwave, since TCP does not know the buffer size(s) used by the driver, 
> by 1 MSS for every 2 MSS it adds to SO_SNDBUF.  Or find some way to do 
> it "on demand" in the about to drop path.
> 
> That or bare ACKs have to be excluded from the overhead checks somehow I 
> guess, or there be more aggressive copying into smaller buffers.
> 

Nope, we need a limit or risk OOM if a malicious peer send ACK flood ;)

To be clear, if I change the tcp_rmem[1] from 87380 to big value, I no
longer have ACK drops, but still poor performance, I am still
investigating.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones April 19, 2012, 6:05 p.m. UTC | #13
>> That or bare ACKs have to be excluded from the overhead checks somehow I
>> guess, or there be more aggressive copying into smaller buffers.
>>
>
> Nope, we need a limit or risk OOM if a malicious peer send ACK flood ;)

Well, there is that...

>
> To be clear, if I change the tcp_rmem[1] from 87380 to big value, I no
> longer have ACK drops, but still poor performance, I am still
> investigating.

What happens if you set net.core.[rw]mem_max to 4 MB and then use 
something like -s 1M -S 1M in netperf?

netperf -t TCP_STREAM -H <remote> -- -s 1M -S 1M -m 64K

(or -m 16K if you want to keep the send size the same as your previous 
tests...)

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen April 20, 2012, 12:27 p.m. UTC | #14
On Thu, 19 Apr 2012, Eric Dumazet wrote:

> On Thu, 2012-04-19 at 14:57 +0300, Ilpo Järvinen wrote:
> 
> > I'm not concerned of the output side, that seems to work because 
> > of the in tcp_retransmit_skb getting rid of the extra first.
> > 
> > The ACK input stuff is more interesting, e.g., this one in 
> > tcp_mark_head_lost:
> > 
> > 	err = tcp_fragment(sk, skb, (packets - oldcnt) * mss, mss);
> > 
> > It splits from TCP_SKB_CB(skb)->seq + (packets - oldcnt) * mss whereas
> > I think the desired point would be: TCP_SKB_CB(skb)->seq + offset_ack + 
> > (packets - oldcnt) * mss?
> > 
> > ...There is similar case in sacktag code too while it's aligning to mss 
> > boundaries in tcp_match_skb_to_sack.
> 
> Hmm yes, so maybe its safer to update TCP_SKB_CB(skb)->seq in
> tcp_tso_acked() (as well as offset_ack) and make needed adjustements in
> tcp_fragment() if we find offset_ack being not null.

I suppose that somewhat works, it helps here a lot that you work only with 
the head skb making lot of cases not possible... I once made something 
similar (before I came up the current shift/merge approach) and ended up 
to this:
  http://www.mail-archive.com/netdev@vger.kernel.org/msg56191.html

...But...

There's another can of worms still it seems.... At least tcp_skb_seglen 
that returns weird results when pcount becomes 1!

...Somewhat related to the pcount == 1 problem, I've long wondered if the 
zeroed gso_size with pcount == 1 is worth keeping in the first place?

However, I kind of liked the neatness in the original approach where ->seq 
does not lie. That would have kept most of stuff very localized because 
each skb is still fully valid and consistent with itself, whereas 
introducing lies adds lots of hidden traps (except for the pcount of 
course, but consistency for it has not been there for some years
already :-)). The tcp_match_skb_to_sack code seems to actually work 
exactly because of this consistency (if I now on the second/third read got 
it right).
diff mbox

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d5984e3..0f57706 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -477,7 +477,8 @@  extern int tcp_retransmit_skb(struct sock *, struct sk_buff *);
 extern void tcp_retransmit_timer(struct sock *sk);
 extern void tcp_xmit_retransmit_queue(struct sock *);
 extern void tcp_simple_retransmit(struct sock *);
-extern int tcp_trim_head(struct sock *, struct sk_buff *, u32);
+extern void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
+				 unsigned int mss_now);
 extern int tcp_fragment(struct sock *, struct sk_buff *, u32, unsigned int);
 
 extern void tcp_send_probe0(struct sock *);
@@ -640,7 +641,8 @@  struct tcp_skb_cb {
 #if IS_ENABLED(CONFIG_IPV6)
 		struct inet6_skb_parm	h6;
 #endif
-	} header;	/* For incoming frames		*/
+		unsigned int offset_ack; /* part of acked data in this skb */
+	} header;
 	__u32		seq;		/* Starting sequence number	*/
 	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
 	__u32		when;		/* used to compute rtt's	*/
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 99448f0..bdec2e6 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3260,25 +3260,25 @@  static void tcp_rearm_rto(struct sock *sk)
 	}
 }
 
-/* If we get here, the whole TSO packet has not been acked. */
+/* If we get here, the whole packet has not been acked.
+ * We used to call tcp_trim_head() to remove acked data from skb,
+ * but its expensive with TSO if our previous clone is still in flight.
+ * We thus maintain an offset_ack, and hope no pskb_expand_head()
+ * is needed until whole packet is acked by upcoming ACKs.
+ */
 static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	u32 packets_acked;
+	u32 oldpcount = tcp_skb_pcount(skb);
 
 	BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una));
 
-	packets_acked = tcp_skb_pcount(skb);
-	if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
-		return 0;
-	packets_acked -= tcp_skb_pcount(skb);
+	TCP_SKB_CB(skb)->header.offset_ack = tp->snd_una - TCP_SKB_CB(skb)->seq;
 
-	if (packets_acked) {
-		BUG_ON(tcp_skb_pcount(skb) == 0);
-		BUG_ON(!before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq));
-	}
+	if (oldpcount > 1)
+		tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
 
-	return packets_acked;
+	return oldpcount - tcp_skb_pcount(skb);
 }
 
 /* Remove acknowledged frames from the retransmission queue. If our packet
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index de8790c..f66df37 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -927,11 +927,15 @@  static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
 	sk_mem_charge(sk, skb->truesize);
 }
 
-/* Initialize TSO segments for a packet. */
-static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
-				 unsigned int mss_now)
+/* Initialize TSO segments for a packet.
+ * Part of skb (offset_ack) might have been acked.
+ */
+void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
+			  unsigned int mss_now)
 {
-	if (skb->len <= mss_now || !sk_can_gso(sk) ||
+	unsigned int len = skb->len - TCP_SKB_CB(skb)->header.offset_ack;
+
+	if (len <= mss_now || !sk_can_gso(sk) ||
 	    skb->ip_summed == CHECKSUM_NONE) {
 		/* Avoid the costly divide in the normal
 		 * non-TSO case.
@@ -940,7 +944,7 @@  static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
 		skb_shinfo(skb)->gso_size = 0;
 		skb_shinfo(skb)->gso_type = 0;
 	} else {
-		skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss_now);
+		skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(len, mss_now);
 		skb_shinfo(skb)->gso_size = mss_now;
 		skb_shinfo(skb)->gso_type = sk->sk_gso_type;
 	}
@@ -1125,15 +1129,20 @@  static void __pskb_trim_head(struct sk_buff *skb, int len)
 	skb->len = skb->data_len;
 }
 
-/* Remove acked data from a packet in the transmit queue. */
-int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+/* Remove acked data from a packet in the transmit queue.
+ * Ony called before retransmit.
+ */
+static int tcp_trim_head(struct sock *sk, struct sk_buff *skb)
 {
+	u32 len = tcp_sk(sk)->snd_una - TCP_SKB_CB(skb)->seq;
+
 	if (skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
 		return -ENOMEM;
 
 	__pskb_trim_head(skb, len);
 
 	TCP_SKB_CB(skb)->seq += len;
+	TCP_SKB_CB(skb)->header.offset_ack = 0;
 	skb->ip_summed = CHECKSUM_PARTIAL;
 
 	skb->truesize	     -= len;
@@ -2096,7 +2105,7 @@  int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 	if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
 		if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
 			BUG();
-		if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
+		if (tcp_trim_head(sk, skb))
 			return -ENOMEM;
 	}