diff mbox

[net,1/2] tcp: Limit number of segments generated by GSO per skb

Message ID 1343668602.2667.6.camel@bwh-desktop.uk.solarflarecom.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Ben Hutchings July 30, 2012, 5:16 p.m. UTC
A peer (or local user) may cause TCP to use a nominal MSS of as little
as 88 (actual MSS of 76 with timestamps).  Given that we have a
sufficiently prodigious local sender and the peer ACKs quickly enough,
it is nevertheless possible to grow the window for such a connection
to the point that we will try to send just under 64K at once.  This
results in a single skb that expands to 861 segments.

In some drivers with TSO support, such an skb will require hundreds of
DMA descriptors; a substantial fraction of a TX ring or even more than
a full ring.  The TX queue selected for the skb may stall and trigger
the TX watchdog repeatedly (since the problem skb will be retried
after the TX reset).  This particularly affects sfc, for which the
issue is designated as CVE-2012-3412.  However it may be that some
hardware or firmware also fails to handle such an extreme TSO request
correctly.

Therefore, limit the number of segments per skb to 100.  This should
make no difference to behaviour unless the actual MSS is less than
about 700.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 include/net/tcp.h     |    3 +++
 net/ipv4/tcp.c        |    4 +++-
 net/ipv4/tcp_output.c |   17 +++++++++--------
 3 files changed, 15 insertions(+), 9 deletions(-)

Comments

Ben Greear July 30, 2012, 5:23 p.m. UTC | #1
On 07/30/2012 10:16 AM, Ben Hutchings wrote:
> A peer (or local user) may cause TCP to use a nominal MSS of as little
> as 88 (actual MSS of 76 with timestamps).  Given that we have a
> sufficiently prodigious local sender and the peer ACKs quickly enough,
> it is nevertheless possible to grow the window for such a connection
> to the point that we will try to send just under 64K at once.  This
> results in a single skb that expands to 861 segments.
>
> In some drivers with TSO support, such an skb will require hundreds of
> DMA descriptors; a substantial fraction of a TX ring or even more than
> a full ring.  The TX queue selected for the skb may stall and trigger
> the TX watchdog repeatedly (since the problem skb will be retried
> after the TX reset).  This particularly affects sfc, for which the
> issue is designated as CVE-2012-3412.  However it may be that some
> hardware or firmware also fails to handle such an extreme TSO request
> correctly.
>
> Therefore, limit the number of segments per skb to 100.  This should
> make no difference to behaviour unless the actual MSS is less than
> about 700.

Please do not do this...or at least allow over-rides.  We love
the trick of seting very small MSS and making the NICs generate
huge numbers of small TCP frames with efficient user-space
logic.   We use this for stateful TCP load testing when high
numbers of tcp packets-per-second is desired.

Intel NICs, including 10G, work just fine with minimal MSS
in this scenario.

Thanks,
Ben
Eric Dumazet July 30, 2012, 5:31 p.m. UTC | #2
On Mon, 2012-07-30 at 18:16 +0100, Ben Hutchings wrote:
> A peer (or local user) may cause TCP to use a nominal MSS of as little
> as 88 (actual MSS of 76 with timestamps).  Given that we have a
> sufficiently prodigious local sender and the peer ACKs quickly enough,
> it is nevertheless possible to grow the window for such a connection
> to the point that we will try to send just under 64K at once.  This
> results in a single skb that expands to 861 segments.
> 
> In some drivers with TSO support, such an skb will require hundreds of
> DMA descriptors; a substantial fraction of a TX ring or even more than
> a full ring.  The TX queue selected for the skb may stall and trigger
> the TX watchdog repeatedly (since the problem skb will be retried
> after the TX reset).  This particularly affects sfc, for which the
> issue is designated as CVE-2012-3412.  However it may be that some
> hardware or firmware also fails to handle such an extreme TSO request
> correctly.
> 
> Therefore, limit the number of segments per skb to 100.  This should
> make no difference to behaviour unless the actual MSS is less than
> about 700.
> 
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> ---


Hmm, isnt GRO path also vulnerable ?

An alternative would be to drop such frames in the ndo_start_xmit(), and
cap sk->sk_gso_max_size (since skb are no longer orphaned...)

Or you could introduce a new wk->sk_gso_max_segments, that your sfc
driver sets to whatever limit ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings July 30, 2012, 7:41 p.m. UTC | #3
On Mon, 2012-07-30 at 10:23 -0700, Ben Greear wrote:
> On 07/30/2012 10:16 AM, Ben Hutchings wrote:
> > A peer (or local user) may cause TCP to use a nominal MSS of as little
> > as 88 (actual MSS of 76 with timestamps).  Given that we have a
> > sufficiently prodigious local sender and the peer ACKs quickly enough,
> > it is nevertheless possible to grow the window for such a connection
> > to the point that we will try to send just under 64K at once.  This
> > results in a single skb that expands to 861 segments.
> >
> > In some drivers with TSO support, such an skb will require hundreds of
> > DMA descriptors; a substantial fraction of a TX ring or even more than
> > a full ring.  The TX queue selected for the skb may stall and trigger
> > the TX watchdog repeatedly (since the problem skb will be retried
> > after the TX reset).  This particularly affects sfc, for which the
> > issue is designated as CVE-2012-3412.  However it may be that some
> > hardware or firmware also fails to handle such an extreme TSO request
> > correctly.
> >
> > Therefore, limit the number of segments per skb to 100.  This should
> > make no difference to behaviour unless the actual MSS is less than
> > about 700.
> 
> Please do not do this...or at least allow over-rides.  We love
> the trick of seting very small MSS and making the NICs generate
> huge numbers of small TCP frames with efficient user-space
> logic.   We use this for stateful TCP load testing when high
> numbers of tcp packets-per-second is desired.

Please test whether this actually makes a difference - my suspicion is
that 100 segments per skb is easily enough to prevent the host being a
bottleneck.

> Intel NICs, including 10G, work just fine with minimal MSS
> in this scenario.

I'll leave this to the Intel maintainers to answer.

Ben.
Ben Greear July 30, 2012, 9 p.m. UTC | #4
On 07/30/2012 12:41 PM, Ben Hutchings wrote:
> On Mon, 2012-07-30 at 10:23 -0700, Ben Greear wrote:
>> On 07/30/2012 10:16 AM, Ben Hutchings wrote:
>>> A peer (or local user) may cause TCP to use a nominal MSS of as little
>>> as 88 (actual MSS of 76 with timestamps).  Given that we have a
>>> sufficiently prodigious local sender and the peer ACKs quickly enough,
>>> it is nevertheless possible to grow the window for such a connection
>>> to the point that we will try to send just under 64K at once.  This
>>> results in a single skb that expands to 861 segments.
>>>
>>> In some drivers with TSO support, such an skb will require hundreds of
>>> DMA descriptors; a substantial fraction of a TX ring or even more than
>>> a full ring.  The TX queue selected for the skb may stall and trigger
>>> the TX watchdog repeatedly (since the problem skb will be retried
>>> after the TX reset).  This particularly affects sfc, for which the
>>> issue is designated as CVE-2012-3412.  However it may be that some
>>> hardware or firmware also fails to handle such an extreme TSO request
>>> correctly.
>>>
>>> Therefore, limit the number of segments per skb to 100.  This should
>>> make no difference to behaviour unless the actual MSS is less than
>>> about 700.
>>
>> Please do not do this...or at least allow over-rides.  We love
>> the trick of seting very small MSS and making the NICs generate
>> huge numbers of small TCP frames with efficient user-space
>> logic.   We use this for stateful TCP load testing when high
>> numbers of tcp packets-per-second is desired.
>
> Please test whether this actually makes a difference - my suspicion is
> that 100 segments per skb is easily enough to prevent the host being a
> bottleneck.

Any CPU I can save I can use for other tasks.  If we can use the
NIC's offload features to segment pkts, then we get near linear
increase in pkts-per-second by adding NICs..at least up to whatever
the total bandwidth of the system is...

If you want to have the OS default to a safe value, that is
fine by me..but please give us a tunable so that we can get
the old behaviour.

It's always possible I'm not the only one using this,
and I think it would be considered bad form to break
existing features and provide no work-around.

Thanks,
Ben

>
>> Intel NICs, including 10G, work just fine with minimal MSS
>> in this scenario.
>
> I'll leave this to the Intel maintainers to answer.
>
> Ben.
>
diff mbox

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e19124b..098a2d0 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -70,6 +70,9 @@  extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 /* The least MTU to use for probing */
 #define TCP_BASE_MSS		512
 
+/* Maximum number of segments we may require GSO to generate from an skb. */
+#define TCP_MAX_GSO_SEGS	100
+
 /* After receiving this amount of duplicate ACKs fast retransmit starts. */
 #define TCP_FASTRETRANS_THRESH 3
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e7e6eea..51d8daf 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -811,7 +811,9 @@  static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 			   old_size_goal + mss_now > xmit_size_goal)) {
 			xmit_size_goal = old_size_goal;
 		} else {
-			tp->xmit_size_goal_segs = xmit_size_goal / mss_now;
+			tp->xmit_size_goal_segs =
+				min_t(u32, xmit_size_goal / mss_now,
+				      TCP_MAX_GSO_SEGS);
 			xmit_size_goal = tp->xmit_size_goal_segs * mss_now;
 		}
 	}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 33cd065..c86c288 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1522,21 +1522,21 @@  static void tcp_cwnd_validate(struct sock *sk)
  * when we would be allowed to send the split-due-to-Nagle skb fully.
  */
 static unsigned int tcp_mss_split_point(const struct sock *sk, const struct sk_buff *skb,
-					unsigned int mss_now, unsigned int cwnd)
+					unsigned int mss_now, unsigned int max_segs)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
-	u32 needed, window, cwnd_len;
+	u32 needed, window, max_len;
 
 	window = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq;
-	cwnd_len = mss_now * cwnd;
+	max_len = mss_now * max_segs;
 
-	if (likely(cwnd_len <= window && skb != tcp_write_queue_tail(sk)))
-		return cwnd_len;
+	if (likely(max_len <= window && skb != tcp_write_queue_tail(sk)))
+		return max_len;
 
 	needed = min(skb->len, window);
 
-	if (cwnd_len <= needed)
-		return cwnd_len;
+	if (max_len <= needed)
+		return max_len;
 
 	return needed - needed % mss_now;
 }
@@ -1999,7 +1999,8 @@  static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		limit = mss_now;
 		if (tso_segs > 1 && !tcp_urg_mode(tp))
 			limit = tcp_mss_split_point(sk, skb, mss_now,
-						    cwnd_quota);
+						    min(cwnd_quota,
+							TCP_MAX_GSO_SEGS));
 
 		if (skb->len > limit &&
 		    unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))