diff mbox

tcp: fix potential corner case issue in segmentation (Was: Re: [PATCH] Do not use TSO/GSO when there is urgent data)

Message ID Pine.LNX.4.64.0811211438420.3930@wrl-59.cs.helsinki.fi
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Ilpo Järvinen Nov. 21, 2008, 1:07 p.m. UTC
On Fri, 21 Nov 2008, Petr Tesarik wrote:

> This patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=12014
> 
> Since most (if not all) implementations of TSO and even the in-kernel
> software GSO do not update the urgent pointer when splitting a large
> segment, it is necessary to turn off TSO/GSO for all outgoing traffic
> with the URG pointer set.

Good observation, I totally missed this possibility of T/GSO while 
looking.

> Looking at tcp_current_mss (and the preceding comment) I even think
> this was the original intention. However, this approach is insufficient,
> because TSO/GSO is turned off only for newly created frames, not for
> frames which were already pending at the arrival of a message with
> MSG_OOB set. These frames were created when TSO/GSO was enabled,
> so they may be large, and they will have the urgent pointer set
> in tcp_transmit_skb().
> 
> With this patch, such large packets will be fragmented again before
> going to the transmit routine.

I wonder if there's some corner case which still fails to fragment
in tcp_retransmit_xmit's in skb->len <= cur_mss case if cur_mss
grew very recently (and therefore skb-len now fits to a single segment).

Patch below, changed subject is due to patchwork as requested by DaveM
(if you'd find that strange).

> As a side note, at least the following NICs are known to screw up
> the urgent pointer in the TCP header when doing TSO:
> 
> 	Intel 82566MM (PCI ID 8086:1049)
> 	Intel 82566DC (PCI ID 8086:104b)
> 	Intel 82541GI (PCI ID 8086:1076)
> 	Broadcom NetXtreme II BCM5708 (PCI ID 14e4:164c)

Heh, it might already be longer than the list we would get from the 
working ones. Not very likely too many people consider urg too
seriously to get it right.

> Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
> CC: Jan Sembera <jsembera@suse.cz>
> CC: Ilpo Jarvinen <ilpo.jarvinen@helsinki.fi>
> 
> --
>  tcp_output.c |    7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -722,7 +722,8 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff 
> *skb)
>  static void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb,
>  				 unsigned int mss_now)
>  {
> -	if (skb->len <= mss_now || !sk_can_gso(sk)) {
> +	if (skb->len <= mss_now || !sk_can_gso(sk) ||
> +	    tcp_urg_mode(tcp_sk(sk))) {
>  		/* Avoid the costly divide in the normal
>  		 * non-TSO case.
>  		 */
> @@ -1163,7 +1164,9 @@ static int tcp_init_tso_segs(struct sock *sk, struct 
> sk_buff *skb,
>  {
>  	int tso_segs = tcp_skb_pcount(skb);
>  
> -	if (!tso_segs || (tso_segs > 1 && tcp_skb_mss(skb) != mss_now)) {
> +	if (!tso_segs ||
> +	    (tso_segs > 1 && (tcp_skb_mss(skb) != mss_now ||
> +			      tcp_urg_mode(tcp_sk(sk))))) {
>  		tcp_set_skb_tso_segs(sk, skb, mss_now);
>  		tso_segs = tcp_skb_pcount(skb);
>  	}

It's a bit intrusive but I couldn't immediately come up with alternative 
that would have worked (came up with some not working ones :-)).

Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Comments

Petr Tesarik Nov. 21, 2008, 3:51 p.m. UTC | #1
Dne Friday 21 of November 2008 14:07:32 Ilpo Järvinen napsal(a):
> On Fri, 21 Nov 2008, Petr Tesarik wrote:
> > This patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=12014
> >
> > Since most (if not all) implementations of TSO and even the in-kernel
> > software GSO do not update the urgent pointer when splitting a large
> > segment, it is necessary to turn off TSO/GSO for all outgoing traffic
> > with the URG pointer set.
>
> Good observation, I totally missed this possibility of T/GSO while
> looking.
>
> > Looking at tcp_current_mss (and the preceding comment) I even think
> > this was the original intention. However, this approach is insufficient,
> > because TSO/GSO is turned off only for newly created frames, not for
> > frames which were already pending at the arrival of a message with
> > MSG_OOB set. These frames were created when TSO/GSO was enabled,
> > so they may be large, and they will have the urgent pointer set
> > in tcp_transmit_skb().
> >
> > With this patch, such large packets will be fragmented again before
> > going to the transmit routine.
>
> I wonder if there's some corner case which still fails to fragment
> in tcp_retransmit_xmit's in skb->len <= cur_mss case if cur_mss
> grew very recently (and therefore skb-len now fits to a single segment).

This shouldn't be a problem, because TSO only applies to packets which are 
larger than MSS, so the problematic case is when cur_mss gets smaller, not 
when it grows. In other words, the original implementation of 
tcp_retransmit_xmit() could never make use of TSO/GSO, anyway...

>[...]
> > Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
> > CC: Jan Sembera <jsembera@suse.cz>
> > CC: Ilpo Jarvinen <ilpo.jarvinen@helsinki.fi>
> >
> > --
> >  tcp_output.c |    7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -722,7 +722,8 @@ static void tcp_queue_skb(struct sock *sk, struct
> > sk_buff *skb)
> >  static void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb,
> >  				 unsigned int mss_now)
> >  {
> > -	if (skb->len <= mss_now || !sk_can_gso(sk)) {
> > +	if (skb->len <= mss_now || !sk_can_gso(sk) ||
> > +	    tcp_urg_mode(tcp_sk(sk))) {
> >  		/* Avoid the costly divide in the normal
> >  		 * non-TSO case.
> >  		 */
> > @@ -1163,7 +1164,9 @@ static int tcp_init_tso_segs(struct sock *sk,
> > struct sk_buff *skb,
> >  {
> >  	int tso_segs = tcp_skb_pcount(skb);
> >
> > -	if (!tso_segs || (tso_segs > 1 && tcp_skb_mss(skb) != mss_now)) {
> > +	if (!tso_segs ||
> > +	    (tso_segs > 1 && (tcp_skb_mss(skb) != mss_now ||
> > +			      tcp_urg_mode(tcp_sk(sk))))) {
> >  		tcp_set_skb_tso_segs(sk, skb, mss_now);
> >  		tso_segs = tcp_skb_pcount(skb);
> >  	}
>
> It's a bit intrusive but I couldn't immediately come up with alternative
> that would have worked (came up with some not working ones :-)).

Yes, I also noticed that. We could add some more code to tcp_mark_urg(), e.g. 
walk sk_write_queue and adjust the pending SKBs there...

Is it OK to simply set all skb->gso_segs to zero, and let the next call to 
tcp_init_tso_segs redo them? I mean, will tcp_init_tso_segs() be always 
called on all SKBs which are in the write queue at the time tcp_mark_urg() is 
called?

Petr
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen Nov. 21, 2008, 5:49 p.m. UTC | #2
On Fri, 21 Nov 2008, Petr Tesarik wrote:

> Dne Friday 21 of November 2008 14:07:32 Ilpo Järvinen napsal(a):
> > On Fri, 21 Nov 2008, Petr Tesarik wrote:
> > > This patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=12014
> > >
> > > Since most (if not all) implementations of TSO and even the in-kernel
> > > software GSO do not update the urgent pointer when splitting a large
> > > segment, it is necessary to turn off TSO/GSO for all outgoing traffic
> > > with the URG pointer set.
> >
> > Good observation, I totally missed this possibility of T/GSO while
> > looking.
> >
> > > Looking at tcp_current_mss (and the preceding comment) I even think
> > > this was the original intention. However, this approach is insufficient,
> > > because TSO/GSO is turned off only for newly created frames, not for
> > > frames which were already pending at the arrival of a message with
> > > MSG_OOB set. These frames were created when TSO/GSO was enabled,
> > > so they may be large, and they will have the urgent pointer set
> > > in tcp_transmit_skb().
> > >
> > > With this patch, such large packets will be fragmented again before
> > > going to the transmit routine.
> >
> > I wonder if there's some corner case which still fails to fragment
> > in tcp_retransmit_xmit's in skb->len <= cur_mss case if cur_mss
> > grew very recently (and therefore skb-len now fits to a single segment).
> 
> This shouldn't be a problem, because TSO only applies to packets which are 
> larger than MSS, so the problematic case is when cur_mss gets smaller, not 
> when it grows. In other words, the original implementation of 
> tcp_retransmit_xmit() could never make use of TSO/GSO, anyway...

I disagree:

1. mss == x
2. skb->len == 2*x => Send using TSO/GSO
3. mss = 2*x (e.g, mtu probe completes)
4. rexmit skb, nothing resets TSO/GSO fields though now skb->len <= mss, 
thus it will get sent as two, smaller than what's necessary, packets using 
TSO/GSO and will not get into that fixed place of yours.

Or did I miss something?

> >[...]
> > > Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
> > > CC: Jan Sembera <jsembera@suse.cz>
> > > CC: Ilpo Jarvinen <ilpo.jarvinen@helsinki.fi>
> > >
> > > --
> > >  tcp_output.c |    7 +++++--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -722,7 +722,8 @@ static void tcp_queue_skb(struct sock *sk, struct
> > > sk_buff *skb)
> > >  static void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb,
> > >  				 unsigned int mss_now)
> > >  {
> > > -	if (skb->len <= mss_now || !sk_can_gso(sk)) {
> > > +	if (skb->len <= mss_now || !sk_can_gso(sk) ||
> > > +	    tcp_urg_mode(tcp_sk(sk))) {
> > >  		/* Avoid the costly divide in the normal
> > >  		 * non-TSO case.
> > >  		 */
> > > @@ -1163,7 +1164,9 @@ static int tcp_init_tso_segs(struct sock *sk,
> > > struct sk_buff *skb,
> > >  {
> > >  	int tso_segs = tcp_skb_pcount(skb);
> > >
> > > -	if (!tso_segs || (tso_segs > 1 && tcp_skb_mss(skb) != mss_now)) {
> > > +	if (!tso_segs ||
> > > +	    (tso_segs > 1 && (tcp_skb_mss(skb) != mss_now ||
> > > +			      tcp_urg_mode(tcp_sk(sk))))) {
> > >  		tcp_set_skb_tso_segs(sk, skb, mss_now);
> > >  		tso_segs = tcp_skb_pcount(skb);
> > >  	}
> >
> > It's a bit intrusive but I couldn't immediately come up with alternative
> > that would have worked (came up with some not working ones :-)).
> 
> Yes, I also noticed that. We could add some more code to tcp_mark_urg(), e.g. 
> walk sk_write_queue and adjust the pending SKBs there...
>
> Is it OK to simply set all skb->gso_segs to zero, and let the next call to 
> tcp_init_tso_segs redo them?

If we walk backwards we could consider short-circuit the walk at 16-bit 
urg field limit. I wouldn't mind if users of such obscure feature pay the 
price but the final decision is up to Dave of course.

> I mean, will tcp_init_tso_segs() be always 
> called on all SKBs which are in the write queue at the time tcp_mark_urg() is 
> called?

I realized there is more breakage: That tcp_current_mss tcp_urg_mode 
is too late as well, it won't work either until we're already past the 
relevant seqno range... It basically starts working at snd_up upwards 
rather than working on snd_up-2^16..snd_up region.

In addition, it's quite impossible to tso/gso successfully past
ceil_to_mss(snd_up - 2^16) boundary anyway because we don't have enough 
bits in the urgent field to tell at which point the fields should get set 
(unless there would be some out-of-band communication channel for the 
urgent sequence number).
Ilpo Järvinen Nov. 21, 2008, 6:21 p.m. UTC | #3
On Fri, 21 Nov 2008, Ilpo Järvinen wrote:

> On Fri, 21 Nov 2008, Petr Tesarik wrote:
> 
> > > It's a bit intrusive but I couldn't immediately come up with alternative
> > > that would have worked (came up with some not working ones :-)).
> > 
> > Yes, I also noticed that. We could add some more code to tcp_mark_urg(), e.g. 
> > walk sk_write_queue and adjust the pending SKBs there...
> >
> > Is it OK to simply set all skb->gso_segs to zero, and let the next call to 
> > tcp_init_tso_segs redo them?
> 
> If we walk backwards we could consider short-circuit the walk at 16-bit 
> urg field limit. I wouldn't mind if users of such obscure feature pay the 
> price but the final decision is up to Dave of course.

On second though, it won't work since those fields get initialized later, 
ie., at the send time so it would undo the effort.
diff mbox

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a524627..129f46b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1944,6 +1944,8 @@  int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 	if (skb->len > cur_mss) {
 		if (tcp_fragment(sk, skb, cur_mss, cur_mss))
 			return -ENOMEM; /* We'll try again later. */
+	} else {
+		tcp_init_tso_segs(sk, skb, cur_mss);
 	}
 
 	/* Collapse two adjacent packets if worthwhile and we can. */