diff mbox

tcp: Reallocate headroom if it would overflow csum_start

Message ID c35c262084e8907098dc2db5ea9690d2119b4916.1365678820.git.tgraf@suug.ch
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Thomas Graf April 11, 2013, 11:19 a.m. UTC
If a TCP retransmission gets partially ACKed and collapsed multiple
times it is possible for the headroom to grow beyond 64K which will
overflow the 16bit skb->csum_start which is based on the start of
the headroom. It has been observed rarely in the wild with IPoIB due
to the 64K MTU.

Verify if the acking and collapsing resulted in a headroom exceeding
what csum_start can cover and reallocate the headroom if so.

LLNL has been running the patch for a while and has not seen the
problem occur since.

A big thank you to Jim Foraker <foraker1@llnl.gov> and the team at
LLNL for helping out with the investigation and testing.

Reported-by: Jim Foraker <foraker1@llnl.gov>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
v2: reallocate headroom instead of preventing further collapsing

 net/ipv4/tcp_output.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Sergei Shtylyov April 11, 2013, 12:41 p.m. UTC | #1
Hello.

On 11-04-2013 15:19, Thomas Graf wrote:

> If a TCP retransmission gets partially ACKed and collapsed multiple
> times it is possible for the headroom to grow beyond 64K which will
> overflow the 16bit skb->csum_start which is based on the start of
> the headroom. It has been observed rarely in the wild with IPoIB due
> to the 64K MTU.

> Verify if the acking and collapsing resulted in a headroom exceeding
> what csum_start can cover and reallocate the headroom if so.

> LLNL has been running the patch for a while and has not seen the
> problem occur since.

> A big thank you to Jim Foraker <foraker1@llnl.gov> and the team at
> LLNL for helping out with the investigation and testing.

> Reported-by: Jim Foraker <foraker1@llnl.gov>
> Signed-off-by: Thomas Graf <tgraf@suug.ch>
[...]

    Minor formatting nit.

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index b44cf81..bf6ceb7 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2388,8 +2388,11 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
>   	 */
>   	TCP_SKB_CB(skb)->when = tcp_time_stamp;
>
> -	/* make sure skb->data is aligned on arches that require it */
> -	if (unlikely(NET_IP_ALIGN && ((unsigned long)skb->data & 3))) {
> +	/* make sure skb->data is aligned on arches that require it
> +	 * and check if ack-trimming & collapsing extended the headroom
> +	 * beyond what csum_start can cover. */

    The preferred multi-line comment style in the networking code:

/* bla
  * bla
  */

WBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 11, 2013, 3:49 p.m. UTC | #2
On Thu, 2013-04-11 at 13:19 +0200, Thomas Graf wrote:
> If a TCP retransmission gets partially ACKed and collapsed multiple
> times it is possible for the headroom to grow beyond 64K which will
> overflow the 16bit skb->csum_start which is based on the start of
> the headroom. It has been observed rarely in the wild with IPoIB due
> to the 64K MTU.
> 
> Verify if the acking and collapsing resulted in a headroom exceeding
> what csum_start can cover and reallocate the headroom if so.
> 
> LLNL has been running the patch for a while and has not seen the
> problem occur since.
> 
> A big thank you to Jim Foraker <foraker1@llnl.gov> and the team at
> LLNL for helping out with the investigation and testing.
> 
> Reported-by: Jim Foraker <foraker1@llnl.gov>
> Signed-off-by: Thomas Graf <tgraf@suug.ch>
> ---
> v2: reallocate headroom instead of preventing further collapsing
> 
>  net/ipv4/tcp_output.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index b44cf81..bf6ceb7 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2388,8 +2388,11 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
>  	 */
>  	TCP_SKB_CB(skb)->when = tcp_time_stamp;
>  
> -	/* make sure skb->data is aligned on arches that require it */
> -	if (unlikely(NET_IP_ALIGN && ((unsigned long)skb->data & 3))) {
> +	/* make sure skb->data is aligned on arches that require it
> +	 * and check if ack-trimming & collapsing extended the headroom
> +	 * beyond what csum_start can cover. */
> +	if (unlikely(NET_IP_ALIGN && ((unsigned long)skb->data & 3) ||
> +		     skb_headroom(skb) >= 0xFFFF)) {
>  		struct sk_buff *nskb = __pskb_copy(skb, MAX_TCP_HEADER,
>  						   GFP_ATOMIC);
>  		return nskb ? tcp_transmit_skb(sk, nskb, 0, GFP_ATOMIC) :

Strange... It was tested on an arch with NET_IP_ALIGN == 2 I presume ?

This fix should also be done for other arches (x86 for example)

I would code the condition like that instead

if ((NET_IP_ALIGN && ((unsigned long)skb->data & 3)) ||
    skb_headroom(skb) >= 0xFFFF)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings April 11, 2013, 5:52 p.m. UTC | #3
On Thu, 2013-04-11 at 08:49 -0700, Eric Dumazet wrote:
> On Thu, 2013-04-11 at 13:19 +0200, Thomas Graf wrote:
> > If a TCP retransmission gets partially ACKed and collapsed multiple
> > times it is possible for the headroom to grow beyond 64K which will
> > overflow the 16bit skb->csum_start which is based on the start of
> > the headroom. It has been observed rarely in the wild with IPoIB due
> > to the 64K MTU.
> > 
> > Verify if the acking and collapsing resulted in a headroom exceeding
> > what csum_start can cover and reallocate the headroom if so.
> > 
> > LLNL has been running the patch for a while and has not seen the
> > problem occur since.
> > 
> > A big thank you to Jim Foraker <foraker1@llnl.gov> and the team at
> > LLNL for helping out with the investigation and testing.
> > 
> > Reported-by: Jim Foraker <foraker1@llnl.gov>
> > Signed-off-by: Thomas Graf <tgraf@suug.ch>
> > ---
> > v2: reallocate headroom instead of preventing further collapsing
> > 
> >  net/ipv4/tcp_output.c | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index b44cf81..bf6ceb7 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -2388,8 +2388,11 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
> >  	 */
> >  	TCP_SKB_CB(skb)->when = tcp_time_stamp;
> >  
> > -	/* make sure skb->data is aligned on arches that require it */
> > -	if (unlikely(NET_IP_ALIGN && ((unsigned long)skb->data & 3))) {
> > +	/* make sure skb->data is aligned on arches that require it
> > +	 * and check if ack-trimming & collapsing extended the headroom
> > +	 * beyond what csum_start can cover. */
> > +	if (unlikely(NET_IP_ALIGN && ((unsigned long)skb->data & 3) ||
> > +		     skb_headroom(skb) >= 0xFFFF)) {
> >  		struct sk_buff *nskb = __pskb_copy(skb, MAX_TCP_HEADER,
> >  						   GFP_ATOMIC);
> >  		return nskb ? tcp_transmit_skb(sk, nskb, 0, GFP_ATOMIC) :
> 
> Strange... It was tested on an arch with NET_IP_ALIGN == 2 I presume ?
> 
> This fix should also be done for other arches (x86 for example)
> 
> I would code the condition like that instead
> 
> if ((NET_IP_ALIGN && ((unsigned long)skb->data & 3)) ||
>     skb_headroom(skb) >= 0xFFFF)

You dropped the unlikely() and added redundant parentheses, which may be
clearer but is still equivalent.

Ben.
Eric Dumazet April 11, 2013, 5:57 p.m. UTC | #4
On Thu, 2013-04-11 at 18:52 +0100, Ben Hutchings wrote:
> On Thu, 2013-04-11 at 08:49 -0700, Eric Dumazet wrote:
> > On Thu, 2013-04-11 at 13:19 +0200, Thomas Graf wrote:
> > > If a TCP retransmission gets partially ACKed and collapsed multiple
> > > times it is possible for the headroom to grow beyond 64K which will
> > > overflow the 16bit skb->csum_start which is based on the start of
> > > the headroom. It has been observed rarely in the wild with IPoIB due
> > > to the 64K MTU.
> > > 
> > > Verify if the acking and collapsing resulted in a headroom exceeding
> > > what csum_start can cover and reallocate the headroom if so.
> > > 
> > > LLNL has been running the patch for a while and has not seen the
> > > problem occur since.
> > > 
> > > A big thank you to Jim Foraker <foraker1@llnl.gov> and the team at
> > > LLNL for helping out with the investigation and testing.
> > > 
> > > Reported-by: Jim Foraker <foraker1@llnl.gov>
> > > Signed-off-by: Thomas Graf <tgraf@suug.ch>
> > > ---
> > > v2: reallocate headroom instead of preventing further collapsing
> > > 
> > >  net/ipv4/tcp_output.c | 7 +++++--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index b44cf81..bf6ceb7 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -2388,8 +2388,11 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
> > >  	 */
> > >  	TCP_SKB_CB(skb)->when = tcp_time_stamp;
> > >  
> > > -	/* make sure skb->data is aligned on arches that require it */
> > > -	if (unlikely(NET_IP_ALIGN && ((unsigned long)skb->data & 3))) {
> > > +	/* make sure skb->data is aligned on arches that require it
> > > +	 * and check if ack-trimming & collapsing extended the headroom
> > > +	 * beyond what csum_start can cover. */
> > > +	if (unlikely(NET_IP_ALIGN && ((unsigned long)skb->data & 3) ||
> > > +		     skb_headroom(skb) >= 0xFFFF)) {
> > >  		struct sk_buff *nskb = __pskb_copy(skb, MAX_TCP_HEADER,
> > >  						   GFP_ATOMIC);
> > >  		return nskb ? tcp_transmit_skb(sk, nskb, 0, GFP_ATOMIC) :
> > 
> > Strange... It was tested on an arch with NET_IP_ALIGN == 2 I presume ?
> > 
> > This fix should also be done for other arches (x86 for example)
> > 
> > I would code the condition like that instead
> > 
> > if ((NET_IP_ALIGN && ((unsigned long)skb->data & 3)) ||
> >     skb_headroom(skb) >= 0xFFFF)
> 
> You dropped the unlikely() and added redundant parentheses, which may be
> clearer but is still equivalent.

I see what you mean...

I just don't like

if (A && B || C)

I prefer in this case

if ((A && B) || C)

Then add the unlikely() if we really care in this _ultra_ slow path

if (unlikely((A && B) || C)) 



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b44cf81..bf6ceb7 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2388,8 +2388,11 @@  int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 	 */
 	TCP_SKB_CB(skb)->when = tcp_time_stamp;
 
-	/* make sure skb->data is aligned on arches that require it */
-	if (unlikely(NET_IP_ALIGN && ((unsigned long)skb->data & 3))) {
+	/* make sure skb->data is aligned on arches that require it
+	 * and check if ack-trimming & collapsing extended the headroom
+	 * beyond what csum_start can cover. */
+	if (unlikely(NET_IP_ALIGN && ((unsigned long)skb->data & 3) ||
+		     skb_headroom(skb) >= 0xFFFF)) {
 		struct sk_buff *nskb = __pskb_copy(skb, MAX_TCP_HEADER,
 						   GFP_ATOMIC);
 		return nskb ? tcp_transmit_skb(sk, nskb, 0, GFP_ATOMIC) :