Message ID | 1343668602.2667.6.camel@bwh-desktop.uk.solarflarecom.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On 07/30/2012 10:16 AM, Ben Hutchings wrote: > A peer (or local user) may cause TCP to use a nominal MSS of as little > as 88 (actual MSS of 76 with timestamps). Given that we have a > sufficiently prodigious local sender and the peer ACKs quickly enough, > it is nevertheless possible to grow the window for such a connection > to the point that we will try to send just under 64K at once. This > results in a single skb that expands to 861 segments. > > In some drivers with TSO support, such an skb will require hundreds of > DMA descriptors; a substantial fraction of a TX ring or even more than > a full ring. The TX queue selected for the skb may stall and trigger > the TX watchdog repeatedly (since the problem skb will be retried > after the TX reset). This particularly affects sfc, for which the > issue is designated as CVE-2012-3412. However it may be that some > hardware or firmware also fails to handle such an extreme TSO request > correctly. > > Therefore, limit the number of segments per skb to 100. This should > make no difference to behaviour unless the actual MSS is less than > about 700. Please do not do this...or at least allow over-rides. We love the trick of seting very small MSS and making the NICs generate huge numbers of small TCP frames with efficient user-space logic. We use this for stateful TCP load testing when high numbers of tcp packets-per-second is desired. Intel NICs, including 10G, work just fine with minimal MSS in this scenario. Thanks, Ben
On Mon, 2012-07-30 at 18:16 +0100, Ben Hutchings wrote: > A peer (or local user) may cause TCP to use a nominal MSS of as little > as 88 (actual MSS of 76 with timestamps). Given that we have a > sufficiently prodigious local sender and the peer ACKs quickly enough, > it is nevertheless possible to grow the window for such a connection > to the point that we will try to send just under 64K at once. This > results in a single skb that expands to 861 segments. > > In some drivers with TSO support, such an skb will require hundreds of > DMA descriptors; a substantial fraction of a TX ring or even more than > a full ring. The TX queue selected for the skb may stall and trigger > the TX watchdog repeatedly (since the problem skb will be retried > after the TX reset). This particularly affects sfc, for which the > issue is designated as CVE-2012-3412. However it may be that some > hardware or firmware also fails to handle such an extreme TSO request > correctly. > > Therefore, limit the number of segments per skb to 100. This should > make no difference to behaviour unless the actual MSS is less than > about 700. > > Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> > --- Hmm, isnt GRO path also vulnerable ? An alternative would be to drop such frames in the ndo_start_xmit(), and cap sk->sk_gso_max_size (since skb are no longer orphaned...) Or you could introduce a new wk->sk_gso_max_segments, that your sfc driver sets to whatever limit ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2012-07-30 at 10:23 -0700, Ben Greear wrote: > On 07/30/2012 10:16 AM, Ben Hutchings wrote: > > A peer (or local user) may cause TCP to use a nominal MSS of as little > > as 88 (actual MSS of 76 with timestamps). Given that we have a > > sufficiently prodigious local sender and the peer ACKs quickly enough, > > it is nevertheless possible to grow the window for such a connection > > to the point that we will try to send just under 64K at once. This > > results in a single skb that expands to 861 segments. > > > > In some drivers with TSO support, such an skb will require hundreds of > > DMA descriptors; a substantial fraction of a TX ring or even more than > > a full ring. The TX queue selected for the skb may stall and trigger > > the TX watchdog repeatedly (since the problem skb will be retried > > after the TX reset). This particularly affects sfc, for which the > > issue is designated as CVE-2012-3412. However it may be that some > > hardware or firmware also fails to handle such an extreme TSO request > > correctly. > > > > Therefore, limit the number of segments per skb to 100. This should > > make no difference to behaviour unless the actual MSS is less than > > about 700. > > Please do not do this...or at least allow over-rides. We love > the trick of seting very small MSS and making the NICs generate > huge numbers of small TCP frames with efficient user-space > logic. We use this for stateful TCP load testing when high > numbers of tcp packets-per-second is desired. Please test whether this actually makes a difference - my suspicion is that 100 segments per skb is easily enough to prevent the host being a bottleneck. > Intel NICs, including 10G, work just fine with minimal MSS > in this scenario. I'll leave this to the Intel maintainers to answer. Ben.
On 07/30/2012 12:41 PM, Ben Hutchings wrote: > On Mon, 2012-07-30 at 10:23 -0700, Ben Greear wrote: >> On 07/30/2012 10:16 AM, Ben Hutchings wrote: >>> A peer (or local user) may cause TCP to use a nominal MSS of as little >>> as 88 (actual MSS of 76 with timestamps). Given that we have a >>> sufficiently prodigious local sender and the peer ACKs quickly enough, >>> it is nevertheless possible to grow the window for such a connection >>> to the point that we will try to send just under 64K at once. This >>> results in a single skb that expands to 861 segments. >>> >>> In some drivers with TSO support, such an skb will require hundreds of >>> DMA descriptors; a substantial fraction of a TX ring or even more than >>> a full ring. The TX queue selected for the skb may stall and trigger >>> the TX watchdog repeatedly (since the problem skb will be retried >>> after the TX reset). This particularly affects sfc, for which the >>> issue is designated as CVE-2012-3412. However it may be that some >>> hardware or firmware also fails to handle such an extreme TSO request >>> correctly. >>> >>> Therefore, limit the number of segments per skb to 100. This should >>> make no difference to behaviour unless the actual MSS is less than >>> about 700. >> >> Please do not do this...or at least allow over-rides. We love >> the trick of seting very small MSS and making the NICs generate >> huge numbers of small TCP frames with efficient user-space >> logic. We use this for stateful TCP load testing when high >> numbers of tcp packets-per-second is desired. > > Please test whether this actually makes a difference - my suspicion is > that 100 segments per skb is easily enough to prevent the host being a > bottleneck. Any CPU I can save I can use for other tasks. If we can use the NIC's offload features to segment pkts, then we get near linear increase in pkts-per-second by adding NICs..at least up to whatever the total bandwidth of the system is... If you want to have the OS default to a safe value, that is fine by me..but please give us a tunable so that we can get the old behaviour. It's always possible I'm not the only one using this, and I think it would be considered bad form to break existing features and provide no work-around. Thanks, Ben > >> Intel NICs, including 10G, work just fine with minimal MSS >> in this scenario. > > I'll leave this to the Intel maintainers to answer. > > Ben. >
diff --git a/include/net/tcp.h b/include/net/tcp.h index e19124b..098a2d0 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -70,6 +70,9 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo); /* The least MTU to use for probing */ #define TCP_BASE_MSS 512 +/* Maximum number of segments we may require GSO to generate from an skb. */ +#define TCP_MAX_GSO_SEGS 100 + /* After receiving this amount of duplicate ACKs fast retransmit starts. */ #define TCP_FASTRETRANS_THRESH 3 diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e7e6eea..51d8daf 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -811,7 +811,9 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, old_size_goal + mss_now > xmit_size_goal)) { xmit_size_goal = old_size_goal; } else { - tp->xmit_size_goal_segs = xmit_size_goal / mss_now; + tp->xmit_size_goal_segs = + min_t(u32, xmit_size_goal / mss_now, + TCP_MAX_GSO_SEGS); xmit_size_goal = tp->xmit_size_goal_segs * mss_now; } } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 33cd065..c86c288 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1522,21 +1522,21 @@ static void tcp_cwnd_validate(struct sock *sk) * when we would be allowed to send the split-due-to-Nagle skb fully. */ static unsigned int tcp_mss_split_point(const struct sock *sk, const struct sk_buff *skb, - unsigned int mss_now, unsigned int cwnd) + unsigned int mss_now, unsigned int max_segs) { const struct tcp_sock *tp = tcp_sk(sk); - u32 needed, window, cwnd_len; + u32 needed, window, max_len; window = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq; - cwnd_len = mss_now * cwnd; + max_len = mss_now * max_segs; - if (likely(cwnd_len <= window && skb != tcp_write_queue_tail(sk))) - return cwnd_len; + if (likely(max_len <= window && skb != tcp_write_queue_tail(sk))) + return max_len; needed = min(skb->len, window); - if (cwnd_len <= needed) - return cwnd_len; + if (max_len <= needed) + return max_len; return needed - needed % mss_now; } @@ -1999,7 +1999,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, limit = mss_now; if (tso_segs > 1 && !tcp_urg_mode(tp)) limit = tcp_mss_split_point(sk, skb, mss_now, - cwnd_quota); + min(cwnd_quota, + TCP_MAX_GSO_SEGS)); if (skb->len > limit && unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
A peer (or local user) may cause TCP to use a nominal MSS of as little as 88 (actual MSS of 76 with timestamps). Given that we have a sufficiently prodigious local sender and the peer ACKs quickly enough, it is nevertheless possible to grow the window for such a connection to the point that we will try to send just under 64K at once. This results in a single skb that expands to 861 segments. In some drivers with TSO support, such an skb will require hundreds of DMA descriptors; a substantial fraction of a TX ring or even more than a full ring. The TX queue selected for the skb may stall and trigger the TX watchdog repeatedly (since the problem skb will be retried after the TX reset). This particularly affects sfc, for which the issue is designated as CVE-2012-3412. However it may be that some hardware or firmware also fails to handle such an extreme TSO request correctly. Therefore, limit the number of segments per skb to 100. This should make no difference to behaviour unless the actual MSS is less than about 700. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> --- include/net/tcp.h | 3 +++ net/ipv4/tcp.c | 4 +++- net/ipv4/tcp_output.c | 17 +++++++++-------- 3 files changed, 15 insertions(+), 9 deletions(-)