Message ID | 1334778707.2472.333.camel@edumazet-glaptop |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
On Wed, 18 Apr 2012, Eric Dumazet wrote: > From: Eric Dumazet <edumazet@google.com> > > While doing netperf sessions on 10Gb Intel nics (ixgbe), I noticed > unexpected profiling results, with pskb_expand_head() being in the top. > > After further analysis, I found we hit badly page refcounts, > because when we transmit full size skb (64 KB), we can receive ACK for > the first segments of the frame while skb was not completely sent by > NIC. > > It takes ~54 us to send a full TSO packet at 10Gb speed, but with a > close peer, we can receive TCP ACK in less than 50 us rtt. > > This is also true on 1Gb links but we were limited by wire speed, not > cpu. > > When we try to trim skb, tcp_trim_head() has to call pskb_expand_head(), > because the skb clone we did for transmit is still alive in TX ring > buffer. > > pskb_expand_head() is really expensive : It has to make about 16+16 > atomic operations on page refcounts, not counting the skb head > reallocation/copy. It increases chances of false sharing. > > In fact, we dont really need to trim skb. This costly operation can be > delayed to the point it is really needed : Thats when a retransmit must > happen. > > Most of the time, upcoming ACKS will ack the whole packet, and we can > free it with minimal cost (since clone was already freed by TX > completion) > > Of course, this means we dont uncharge the acked part from socket limits > until retransmit, but this is hardly a concern with current autotuning > (around 4MB per socket) > Even with small cwnd limit, a single packet can not hold more than half > the window. > > Performance results on my Q6600 cpu and 82599EB 10-Gigabit card : > About 3% less cpu used for same workload (single netperf TCP_STREAM), > bounded by x4 PCI-e slots (4660 Mbits). > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Cc: Tom Herbert <therbert@google.com> > Cc: Neal Cardwell <ncardwell@google.com> > Cc: Maciej Żenczykowski <maze@google.com> > Cc: Yuchung Cheng <ycheng@google.com> > --- > v2 : added Neal suggestions > > include/net/tcp.h | 6 ++++-- > net/ipv4/tcp_input.c | 22 +++++++++++----------- > net/ipv4/tcp_output.c | 25 +++++++++++++++++-------- > 3 files changed, 32 insertions(+), 21 deletions(-) > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index d5984e3..0f57706 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -477,7 +477,8 @@ extern int tcp_retransmit_skb(struct sock *, struct sk_buff *); > extern void tcp_retransmit_timer(struct sock *sk); > extern void tcp_xmit_retransmit_queue(struct sock *); > extern void tcp_simple_retransmit(struct sock *); > -extern int tcp_trim_head(struct sock *, struct sk_buff *, u32); > +extern void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb, > + unsigned int mss_now); > extern int tcp_fragment(struct sock *, struct sk_buff *, u32, unsigned int); > > extern void tcp_send_probe0(struct sock *); > @@ -640,7 +641,8 @@ struct tcp_skb_cb { > #if IS_ENABLED(CONFIG_IPV6) > struct inet6_skb_parm h6; > #endif > - } header; /* For incoming frames */ > + unsigned int offset_ack; /* part of acked data in this skb */ > + } header; > __u32 seq; /* Starting sequence number */ > __u32 end_seq; /* SEQ + FIN + SYN + datalen */ > __u32 when; /* used to compute rtt's */ > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 99448f0..bdec2e6 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -3260,25 +3260,25 @@ static void tcp_rearm_rto(struct sock *sk) > } > } > > -/* If we get here, the whole TSO packet has not been acked. */ > +/* If we get here, the whole packet has not been acked. > + * We used to call tcp_trim_head() to remove acked data from skb, > + * but its expensive with TSO if our previous clone is still in flight. > + * We thus maintain an offset_ack, and hope no pskb_expand_head() > + * is needed until whole packet is acked by upcoming ACKs. > + */ > static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb) > { > struct tcp_sock *tp = tcp_sk(sk); > - u32 packets_acked; > + u32 oldpcount = tcp_skb_pcount(skb); > > BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una)); > > - packets_acked = tcp_skb_pcount(skb); > - if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq)) > - return 0; > - packets_acked -= tcp_skb_pcount(skb); > + TCP_SKB_CB(skb)->header.offset_ack = tp->snd_una - TCP_SKB_CB(skb)->seq; Now that you have non-zero offset_ack, are the tcp_fragment() callsites safe and working? ...I'm mostly worried about tcp_mark_head_lost which does some assumptions about tp->snd_una and TCP_SKB_CB(skb)->seq, however, also other fragmenting does not preserve offset_ack properly (which might not be end of world though)?
On Thu, 2012-04-19 at 14:10 +0300, Ilpo Järvinen wrote: > Now that you have non-zero offset_ack, are the tcp_fragment() callsites > safe and working? ...I'm mostly worried about tcp_mark_head_lost which > does some assumptions about tp->snd_una and TCP_SKB_CB(skb)->seq, however, > also other fragmenting does not preserve offset_ack properly (which might > not be end of world though)? Good point, I'll take a look. I'll provide a v3 anyway with more performance data, I setup two cards in PCI x8 slots to get full bandwidth. Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2012-04-19 at 13:30 +0200, Eric Dumazet wrote: > On Thu, 2012-04-19 at 14:10 +0300, Ilpo Järvinen wrote: > > > Now that you have non-zero offset_ack, are the tcp_fragment() callsites > > safe and working? ...I'm mostly worried about tcp_mark_head_lost which > > does some assumptions about tp->snd_una and TCP_SKB_CB(skb)->seq, however, > > also other fragmenting does not preserve offset_ack properly (which might > > not be end of world though)? > > Good point, I'll take a look. Hmm, the only point this could matter is if a packet is retransmitted. For other packets, offset_ack = 0 (default value on skb allocation) And tcp_retransmit_skb() first call tcp_trim_head(sk, skb) if needed so tcp_fragment() is called with == 0 if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) { if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una)) BUG(); if (tcp_trim_head(sk, skb)) return -ENOMEM; } ... if (skb->len > cur_mss) { if (tcp_fragment(sk, skb, cur_mss, cur_mss)) I could add a BUG_ON(offset_ack == 0) to make sure this assertion is true. What do you think ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 19 Apr 2012, Eric Dumazet wrote: > On Thu, 2012-04-19 at 13:30 +0200, Eric Dumazet wrote: > > On Thu, 2012-04-19 at 14:10 +0300, Ilpo Järvinen wrote: > > > > > Now that you have non-zero offset_ack, are the tcp_fragment() callsites > > > safe and working? ...I'm mostly worried about tcp_mark_head_lost which > > > does some assumptions about tp->snd_una and TCP_SKB_CB(skb)->seq, however, > > > also other fragmenting does not preserve offset_ack properly (which might > > > not be end of world though)? > > > > Good point, I'll take a look. > > Hmm, the only point this could matter is if a packet is retransmitted. > > For other packets, offset_ack = 0 (default value on skb allocation) > > And tcp_retransmit_skb() first call tcp_trim_head(sk, skb) if needed so > tcp_fragment() is called with == 0 > > if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) { > if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una)) > BUG(); > if (tcp_trim_head(sk, skb)) > return -ENOMEM; > } > > ... > if (skb->len > cur_mss) { > if (tcp_fragment(sk, skb, cur_mss, cur_mss)) > > > > I could add a BUG_ON(offset_ack == 0) to make sure this assertion is > true. If you end up putting something like that make sure you use WARN_ON instead as this surely isn't fatal enough to warrant full stop of the box :-). > What do you think ? I'm not concerned of the output side, that seems to work because of the in tcp_retransmit_skb getting rid of the extra first. The ACK input stuff is more interesting, e.g., this one in tcp_mark_head_lost: err = tcp_fragment(sk, skb, (packets - oldcnt) * mss, mss); It splits from TCP_SKB_CB(skb)->seq + (packets - oldcnt) * mss whereas I think the desired point would be: TCP_SKB_CB(skb)->seq + offset_ack + (packets - oldcnt) * mss? ...There is similar case in sacktag code too while it's aligning to mss boundaries in tcp_match_skb_to_sack.
On Thu, 2012-04-19 at 14:57 +0300, Ilpo Järvinen wrote: > I'm not concerned of the output side, that seems to work because > of the in tcp_retransmit_skb getting rid of the extra first. > > The ACK input stuff is more interesting, e.g., this one in > tcp_mark_head_lost: > > err = tcp_fragment(sk, skb, (packets - oldcnt) * mss, mss); > > It splits from TCP_SKB_CB(skb)->seq + (packets - oldcnt) * mss whereas > I think the desired point would be: TCP_SKB_CB(skb)->seq + offset_ack + > (packets - oldcnt) * mss? > > ...There is similar case in sacktag code too while it's aligning to mss > boundaries in tcp_match_skb_to_sack. Hmm yes, so maybe its safer to update TCP_SKB_CB(skb)->seq in tcp_tso_acked() (as well as offset_ack) and make needed adjustements in tcp_fragment() if we find offset_ack being not null. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2012-04-19 at 13:30 +0200, Eric Dumazet wrote: > I'll provide a v3 anyway with more performance data, I setup two cards > in PCI x8 slots to get full bandwidth. Incidentally, using PCI x8 slots dont anymore trigger the slow path on unpatched kernel and a single flow (~9410 Mbits) It seems we are lucky enough to TX complete sent clones before trying to tcp_trim_head() when processing ACK Sounds like a timing issue, and fact that drivers batches TX completions and RX completions. Also BQL might have changed things a bit here (ixgbe is BQL enabled) Only if I start several concurrent flows I see the pskb_expand_head() overhead. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2012-04-19 at 15:18 +0200, Eric Dumazet wrote: > On Thu, 2012-04-19 at 13:30 +0200, Eric Dumazet wrote: > > > I'll provide a v3 anyway with more performance data, I setup two cards > > in PCI x8 slots to get full bandwidth. > > Incidentally, using PCI x8 slots dont anymore trigger the slow path on > unpatched kernel and a single flow (~9410 Mbits) > > It seems we are lucky enough to TX complete sent clones before trying to > tcp_trim_head() when processing ACK > > Sounds like a timing issue, and fact that drivers batches TX completions > and RX completions. > > Also BQL might have changed things a bit here (ixgbe is BQL enabled) > > Only if I start several concurrent flows I see the pskb_expand_head() > overhead. > > And disabling GRO on receiver definitely demonstrates the problem, even with a single flow. (and performance drops from 9410 Mbit to 6050 Mbit) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2012-04-19 at 15:52 +0200, Eric Dumazet wrote: > And disabling GRO on receiver definitely demonstrates the problem, even > with a single flow. (and performance drops from 9410 Mbit to 6050 Mbit) That insane. Performance drops so much because we _drop_ incoming ACKS : < TCPSackShifted: 39117 < TCPSackMerged: 16500 < TCPSackShiftFallback: 5092 < TCPBacklogDrop: 27965 --- > TCPSackShifted: 35122 > TCPSackMerged: 16368 > TCPSackShiftFallback: 4889 > TCPBacklogDrop: 23247 Hmm, maybe we should reduce skb->truesize for small packets before queueing them in socket backlog... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/19/2012 07:10 AM, Eric Dumazet wrote: > On Thu, 2012-04-19 at 15:52 +0200, Eric Dumazet wrote: > >> And disabling GRO on receiver definitely demonstrates the problem, even >> with a single flow. (and performance drops from 9410 Mbit to 6050 Mbit) > > That insane. > > Performance drops so much because we _drop_ incoming ACKS : > > < TCPSackShifted: 39117 > < TCPSackMerged: 16500 > < TCPSackShiftFallback: 5092 > < TCPBacklogDrop: 27965 > --- >> TCPSackShifted: 35122 >> TCPSackMerged: 16368 >> TCPSackShiftFallback: 4889 >> TCPBacklogDrop: 23247 > > Hmm, maybe we should reduce skb->truesize for small packets before > queueing them in socket backlog... By copying them to smaller buffers? Or just by altering truesize? Wasn't the whole point of fixing all the broken truesize settings to accurately account for memory consumed? rick -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2012-04-19 at 10:20 -0700, Rick Jones wrote: > By copying them to smaller buffers? Or just by altering truesize? > Wasn't the whole point of fixing all the broken truesize settings to > accurately account for memory consumed? I checked, their truesize are OK (1024+256) for ixgbe driver. They could be little smaller, but not that much. (512 + 256) No, its only the sk_rcvbuf is small for a tcp sender, and sk_add_backlog() makes sure we dont queue more than sk_rcvbuf bytes in backlog. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/19/2012 10:25 AM, Eric Dumazet wrote: > On Thu, 2012-04-19 at 10:20 -0700, Rick Jones wrote: > >> By copying them to smaller buffers? Or just by altering truesize? >> Wasn't the whole point of fixing all the broken truesize settings to >> accurately account for memory consumed? > > I checked, their truesize are OK (1024+256) for ixgbe driver. > They could be little smaller, but not that much. (512 + 256) > > No, its only the sk_rcvbuf is small for a tcp sender, > and sk_add_backlog() makes sure we dont queue more than sk_rcvbuf > bytes in backlog. Sounds like a variation on the theme of wildly divergent inbound/outbound bandwidth and constraining ACKs constraining throughput - only with buffer sizes. 87380 is the default SO_RCVBUF right? That should have allowed 87380/1280 or 68 ACKs to be queued. Without ACK stretching from GRO that should have covered 68 * 2896 or 196928 bytes. Based on your previous 54 usec to transmit 64 KB that would be 162+ usecs to accumulate those ACKs, so I guess a question becomes if TCP can be held-off processing ACKs for > 162 usecs, and if so and that cannot be changed, the autotuning will have to start increasing SO_SNDBUF alongside so_sndbuf even if the endpoint is not receiving. As a handwave, since TCP does not know the buffer size(s) used by the driver, by 1 MSS for every 2 MSS it adds to SO_SNDBUF. Or find some way to do it "on demand" in the about to drop path. That or bare ACKs have to be excluded from the overhead checks somehow I guess, or there be more aggressive copying into smaller buffers. Thankfully, when applications make explicit setsockopt() calls, they tend (ok, perhaps that is a bit of a guess) to set both SO_SNDBUF and SO_RCVBUF at the same time. rick -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2012-04-19 at 10:48 -0700, Rick Jones wrote: > On 04/19/2012 10:25 AM, Eric Dumazet wrote: > > On Thu, 2012-04-19 at 10:20 -0700, Rick Jones wrote: > > > >> By copying them to smaller buffers? Or just by altering truesize? > >> Wasn't the whole point of fixing all the broken truesize settings to > >> accurately account for memory consumed? > > > > I checked, their truesize are OK (1024+256) for ixgbe driver. > > They could be little smaller, but not that much. (512 + 256) > > > > No, its only the sk_rcvbuf is small for a tcp sender, > > and sk_add_backlog() makes sure we dont queue more than sk_rcvbuf > > bytes in backlog. > > Sounds like a variation on the theme of wildly divergent > inbound/outbound bandwidth and constraining ACKs constraining throughput > - only with buffer sizes. > > 87380 is the default SO_RCVBUF right? That should have allowed > 87380/1280 or 68 ACKs to be queued. Without ACK stretching from GRO > that should have covered 68 * 2896 or 196928 bytes. Based on your > previous 54 usec to transmit 64 KB that would be 162+ usecs to > accumulate those ACKs, so I guess a question becomes if TCP can be > held-off processing ACKs for > 162 usecs, and if so and that cannot be > changed, the autotuning will have to start increasing SO_SNDBUF > alongside so_sndbuf even if the endpoint is not receiving. As a > handwave, since TCP does not know the buffer size(s) used by the driver, > by 1 MSS for every 2 MSS it adds to SO_SNDBUF. Or find some way to do > it "on demand" in the about to drop path. > > That or bare ACKs have to be excluded from the overhead checks somehow I > guess, or there be more aggressive copying into smaller buffers. > Nope, we need a limit or risk OOM if a malicious peer send ACK flood ;) To be clear, if I change the tcp_rmem[1] from 87380 to big value, I no longer have ACK drops, but still poor performance, I am still investigating. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> That or bare ACKs have to be excluded from the overhead checks somehow I >> guess, or there be more aggressive copying into smaller buffers. >> > > Nope, we need a limit or risk OOM if a malicious peer send ACK flood ;) Well, there is that... > > To be clear, if I change the tcp_rmem[1] from 87380 to big value, I no > longer have ACK drops, but still poor performance, I am still > investigating. What happens if you set net.core.[rw]mem_max to 4 MB and then use something like -s 1M -S 1M in netperf? netperf -t TCP_STREAM -H <remote> -- -s 1M -S 1M -m 64K (or -m 16K if you want to keep the send size the same as your previous tests...) rick -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 19 Apr 2012, Eric Dumazet wrote: > On Thu, 2012-04-19 at 14:57 +0300, Ilpo Järvinen wrote: > > > I'm not concerned of the output side, that seems to work because > > of the in tcp_retransmit_skb getting rid of the extra first. > > > > The ACK input stuff is more interesting, e.g., this one in > > tcp_mark_head_lost: > > > > err = tcp_fragment(sk, skb, (packets - oldcnt) * mss, mss); > > > > It splits from TCP_SKB_CB(skb)->seq + (packets - oldcnt) * mss whereas > > I think the desired point would be: TCP_SKB_CB(skb)->seq + offset_ack + > > (packets - oldcnt) * mss? > > > > ...There is similar case in sacktag code too while it's aligning to mss > > boundaries in tcp_match_skb_to_sack. > > Hmm yes, so maybe its safer to update TCP_SKB_CB(skb)->seq in > tcp_tso_acked() (as well as offset_ack) and make needed adjustements in > tcp_fragment() if we find offset_ack being not null. I suppose that somewhat works, it helps here a lot that you work only with the head skb making lot of cases not possible... I once made something similar (before I came up the current shift/merge approach) and ended up to this: http://www.mail-archive.com/netdev@vger.kernel.org/msg56191.html ...But... There's another can of worms still it seems.... At least tcp_skb_seglen that returns weird results when pcount becomes 1! ...Somewhat related to the pcount == 1 problem, I've long wondered if the zeroed gso_size with pcount == 1 is worth keeping in the first place? However, I kind of liked the neatness in the original approach where ->seq does not lie. That would have kept most of stuff very localized because each skb is still fully valid and consistent with itself, whereas introducing lies adds lots of hidden traps (except for the pcount of course, but consistency for it has not been there for some years already :-)). The tcp_match_skb_to_sack code seems to actually work exactly because of this consistency (if I now on the second/third read got it right).
diff --git a/include/net/tcp.h b/include/net/tcp.h index d5984e3..0f57706 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -477,7 +477,8 @@ extern int tcp_retransmit_skb(struct sock *, struct sk_buff *); extern void tcp_retransmit_timer(struct sock *sk); extern void tcp_xmit_retransmit_queue(struct sock *); extern void tcp_simple_retransmit(struct sock *); -extern int tcp_trim_head(struct sock *, struct sk_buff *, u32); +extern void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb, + unsigned int mss_now); extern int tcp_fragment(struct sock *, struct sk_buff *, u32, unsigned int); extern void tcp_send_probe0(struct sock *); @@ -640,7 +641,8 @@ struct tcp_skb_cb { #if IS_ENABLED(CONFIG_IPV6) struct inet6_skb_parm h6; #endif - } header; /* For incoming frames */ + unsigned int offset_ack; /* part of acked data in this skb */ + } header; __u32 seq; /* Starting sequence number */ __u32 end_seq; /* SEQ + FIN + SYN + datalen */ __u32 when; /* used to compute rtt's */ diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 99448f0..bdec2e6 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3260,25 +3260,25 @@ static void tcp_rearm_rto(struct sock *sk) } } -/* If we get here, the whole TSO packet has not been acked. */ +/* If we get here, the whole packet has not been acked. + * We used to call tcp_trim_head() to remove acked data from skb, + * but its expensive with TSO if our previous clone is still in flight. + * We thus maintain an offset_ack, and hope no pskb_expand_head() + * is needed until whole packet is acked by upcoming ACKs. + */ static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); - u32 packets_acked; + u32 oldpcount = tcp_skb_pcount(skb); BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una)); - packets_acked = tcp_skb_pcount(skb); - if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq)) - return 0; - packets_acked -= tcp_skb_pcount(skb); + TCP_SKB_CB(skb)->header.offset_ack = tp->snd_una - TCP_SKB_CB(skb)->seq; - if (packets_acked) { - BUG_ON(tcp_skb_pcount(skb) == 0); - BUG_ON(!before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq)); - } + if (oldpcount > 1) + tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb)); - return packets_acked; + return oldpcount - tcp_skb_pcount(skb); } /* Remove acknowledged frames from the retransmission queue. If our packet diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index de8790c..f66df37 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -927,11 +927,15 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb) sk_mem_charge(sk, skb->truesize); } -/* Initialize TSO segments for a packet. */ -static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb, - unsigned int mss_now) +/* Initialize TSO segments for a packet. + * Part of skb (offset_ack) might have been acked. + */ +void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb, + unsigned int mss_now) { - if (skb->len <= mss_now || !sk_can_gso(sk) || + unsigned int len = skb->len - TCP_SKB_CB(skb)->header.offset_ack; + + if (len <= mss_now || !sk_can_gso(sk) || skb->ip_summed == CHECKSUM_NONE) { /* Avoid the costly divide in the normal * non-TSO case. @@ -940,7 +944,7 @@ static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb, skb_shinfo(skb)->gso_size = 0; skb_shinfo(skb)->gso_type = 0; } else { - skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss_now); + skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(len, mss_now); skb_shinfo(skb)->gso_size = mss_now; skb_shinfo(skb)->gso_type = sk->sk_gso_type; } @@ -1125,15 +1129,20 @@ static void __pskb_trim_head(struct sk_buff *skb, int len) skb->len = skb->data_len; } -/* Remove acked data from a packet in the transmit queue. */ -int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len) +/* Remove acked data from a packet in the transmit queue. + * Ony called before retransmit. + */ +static int tcp_trim_head(struct sock *sk, struct sk_buff *skb) { + u32 len = tcp_sk(sk)->snd_una - TCP_SKB_CB(skb)->seq; + if (skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) return -ENOMEM; __pskb_trim_head(skb, len); TCP_SKB_CB(skb)->seq += len; + TCP_SKB_CB(skb)->header.offset_ack = 0; skb->ip_summed = CHECKSUM_PARTIAL; skb->truesize -= len; @@ -2096,7 +2105,7 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb) if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) { if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una)) BUG(); - if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq)) + if (tcp_trim_head(sk, skb)) return -ENOMEM; }