diff mbox

TCP_DEFER_ACCEPT is missing counter update

Message ID 4AD7FE01.1010805@gmail.com
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Oct. 16, 2009, 5 a.m. UTC
Eric Dumazet a écrit :
> 
> 
> So, it appears defer_accept value is not an inherited attribute,
> but shared by all embryons. Therefore we should not touch it.
> 
> Of course it should be done, or add a new connection field to count number
> of pure ACKS received on each SYN_RECV embryon.
> 

Could be something like this ? (on top of net-next-2.6 of course)

7 bits is more than enough, we could take 5 bits IMHO.


Thanks

[PATCH net-next-2.6] tcp: TCP_DEFER_ACCEPT fix

Julian Anastasov pointed out defer_accept was an attribute of listening socket,
shared by all embryons. We therefore need a new per connection attribute.

We can split current u8 used by cookie_ts into 7 bits used to store
number of pure ACKS received by the embryon, and 1 bit to store cookie_ts.

Note, I use an unsigned int field so that kmemcheck doesnt shoot,
so patch also touches cookie_v4_init_sequence()/cookie_v6_init_sequence()
implementations.

Reported-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/request_sock.h |    9 ++++++---
 include/net/tcp.h          |    6 ++++--
 net/ipv4/syncookies.c      |    7 ++++---
 net/ipv4/tcp_ipv4.c        |    2 +-
 net/ipv4/tcp_minisocks.c   |    4 ++--
 net/ipv6/syncookies.c      |    6 +++---
 net/ipv6/tcp_ipv6.c        |    2 +-
 7 files changed, 21 insertions(+), 15 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Willy Tarreau Oct. 16, 2009, 5:29 a.m. UTC | #1
On Fri, Oct 16, 2009 at 07:00:49AM +0200, Eric Dumazet wrote:
> Eric Dumazet a écrit :
> > 
> > 
> > So, it appears defer_accept value is not an inherited attribute,
> > but shared by all embryons. Therefore we should not touch it.
> > 
> > Of course it should be done, or add a new connection field to count number
> > of pure ACKS received on each SYN_RECV embryon.
> > 
> 
> Could be something like this ? (on top of net-next-2.6 of course)
> 
> 7 bits is more than enough, we could take 5 bits IMHO.

Couldn't we just rely on the retrans vs rskq_defer_accept comparison ?

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 16, 2009, 6:05 a.m. UTC | #2
Willy Tarreau a écrit :
> On Fri, Oct 16, 2009 at 07:00:49AM +0200, Eric Dumazet wrote:
>> Eric Dumazet a écrit :
>>>
>>> So, it appears defer_accept value is not an inherited attribute,
>>> but shared by all embryons. Therefore we should not touch it.
>>>
>>> Of course it should be done, or add a new connection field to count number
>>> of pure ACKS received on each SYN_RECV embryon.
>>>
>> Could be something like this ? (on top of net-next-2.6 of course)
>>
>> 7 bits is more than enough, we could take 5 bits IMHO.
> 
> Couldn't we just rely on the retrans vs rskq_defer_accept comparison ?
> 

In this case, we lose TCP_DEFER_ACCEPT advantage in case one SYN-ACK was dropped
by the network : We wakeup the listening server when first ACK comes from client,
instead of really wait the request.

I think being able to count pure-acks would be slighly better, and cost nothing.


retrans is the number of SYN-RECV (re)sent, while req_acks would count number of
pure ACK received.

Those numbers, in an ideal world should be related, but could differ in real world ?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Oct. 16, 2009, 6:18 a.m. UTC | #3
On Fri, Oct 16, 2009 at 08:05:19AM +0200, Eric Dumazet wrote:
> > Couldn't we just rely on the retrans vs rskq_defer_accept comparison ?
> > 
> 
> In this case, we lose TCP_DEFER_ACCEPT advantage in case one SYN-ACK was dropped
> by the network : We wakeup the listening server when first ACK comes from client,
> instead of really wait the request.
> 
> I think being able to count pure-acks would be slighly better, and cost nothing.
> 
> 
> retrans is the number of SYN-RECV (re)sent, while req_acks would count number of
> pure ACK received.
> 
> Those numbers, in an ideal world should be related, but could differ in real world ?

Yes it could differ if a pure ACK is lost between the client and the server,
but in my opinion what is important is not to precisely account the number
of ACKs to ensure we wake up exactly after XXX ACKs received, but that in
most common situations we avoid to wake up too early.

Also, keep in mind that the TCP_DEFER_ACCEPT parameter is passed in number
of seconds by the application, which are in turn converted to a number of
retransmits based on our own timer, which means that our SYN-ACK counter
is what most closely matches the application's expected delay, even if an
ACK from the client gets lost in between or if a client's stack retransmits
pure ACKs very fast for any implementation-specific reason.

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 16, 2009, 7:08 a.m. UTC | #4
Willy Tarreau a écrit :
> On Fri, Oct 16, 2009 at 08:05:19AM +0200, Eric Dumazet wrote:
>>> Couldn't we just rely on the retrans vs rskq_defer_accept comparison ?
>>>
>> In this case, we lose TCP_DEFER_ACCEPT advantage in case one SYN-ACK was dropped
>> by the network : We wakeup the listening server when first ACK comes from client,
>> instead of really wait the request.
>>
>> I think being able to count pure-acks would be slighly better, and cost nothing.
>>
>>
>> retrans is the number of SYN-RECV (re)sent, while req_acks would count number of
>> pure ACK received.
>>
>> Those numbers, in an ideal world should be related, but could differ in real world ?
> 
> Yes it could differ if a pure ACK is lost between the client and the server,
> but in my opinion what is important is not to precisely account the number
> of ACKs to ensure we wake up exactly after XXX ACKs received, but that in
> most common situations we avoid to wake up too early.
> 

We basically same thing, but you misundertood me. I was concerning about
one lost (server -> client SYN-ACK), not a lost (client -> server ACK) which is fine
(even without playing with TCP_DEFER_ACCEPT at all)

In this case, if we do the retrans test, we'll accept the first (client -> server ACK)
and wakeup the application, while most probably we'll receive the client request
 few milli second later.

> Also, keep in mind that the TCP_DEFER_ACCEPT parameter is passed in number
> of seconds by the application, which are in turn converted to a number of
> retransmits based on our own timer, which means that our SYN-ACK counter
> is what most closely matches the application's expected delay, even if an
> ACK from the client gets lost in between or if a client's stack retransmits
> pure ACKs very fast for any implementation-specific reason.
> 

Well, this is why converting application delay (sockopt() argument) in second units
to a number of SYN-ACK counter is subobptimal and error prone.

This might be changed to be mapped to what documentation states : a number of seconds,
or even better a number of milli seconds (new TCP_DEFER_ACCEPT_MS setsockopt cmd),
because a high performance server wont play with > 1 sec values anyway.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Oct. 16, 2009, 7:19 a.m. UTC | #5
On Fri, Oct 16, 2009 at 09:08:59AM +0200, Eric Dumazet wrote:
(...)
> > Yes it could differ if a pure ACK is lost between the client and the server,
> > but in my opinion what is important is not to precisely account the number
> > of ACKs to ensure we wake up exactly after XXX ACKs received, but that in
> > most common situations we avoid to wake up too early.
> > 
> 
> We basically same thing, but you misundertood me. I was concerning about
> one lost (server -> client SYN-ACK), not a lost (client -> server ACK) which is fine
> (even without playing with TCP_DEFER_ACCEPT at all)
> 
> In this case, if we do the retrans test, we'll accept the first (client -> server ACK)
> and wakeup the application, while most probably we'll receive the client request
>  few milli second later.

OK I get your point. We can detect that though, as Julian explained it, with
the ->acked field. It indicates we got an ACK, which proves the SYN-ACK was
received. At first glance, I think that Julian's algorithm explained at the
end of his mail exactly covers all cases without using any additional field,
though this is not an issue anyway.

> > Also, keep in mind that the TCP_DEFER_ACCEPT parameter is passed in number
> > of seconds by the application, which are in turn converted to a number of
> > retransmits based on our own timer, which means that our SYN-ACK counter
> > is what most closely matches the application's expected delay, even if an
> > ACK from the client gets lost in between or if a client's stack retransmits
> > pure ACKs very fast for any implementation-specific reason.
> > 
> 
> Well, this is why converting application delay (sockopt() argument) in second units
> to a number of SYN-ACK counter is subobptimal and error prone.

I agree, but it allows the application to be unware of retransmit timers.

> This might be changed to be mapped to what documentation states : a number of seconds,
> or even better a number of milli seconds (new TCP_DEFER_ACCEPT_MS setsockopt cmd),
> because a high performance server wont play with > 1 sec values anyway.

It would be nice but it would require a new timer. Current implementation
does not need any and is efficient enough for most common cases. In fact it
would have been better to simply be able to specify that we want to skip one
empty ACK (or X empty ACKs). But let's make use of what we currently have,
with your (or Julian's) changes, it should cover almost all usages without
changing semantics for applications.

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index c719084..3464d90 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -45,9 +45,12 @@  struct request_sock_ops {
  */
 struct request_sock {
 	struct request_sock		*dl_next; /* Must be first member! */
-	u16				mss;
-	u8				retrans;
-	u8				cookie_ts; /* syncookie: encode tcpopts in timestamp */
+	kmemcheck_bitfield_begin(flags);
+	unsigned int			mss : 16,
+					retrans : 8,
+					req_acks : 7,
+					cookie_ts : 1; /* syncookie: encode tcpopts in timestamp */
+	kmemcheck_bitfield_end(flags);
 	/* The following two fields can be easily recomputed I think -AK */
 	u32				window_clamp; /* window clamp at creation time */
 	u32				rcv_wnd;	  /* rcv_wnd offered first time */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 03a49c7..a2c0439 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -453,7 +453,7 @@  extern __u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS];
 extern struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb, 
 				    struct ip_options *opt);
 extern __u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, 
-				     __u16 *mss);
+				     struct request_sock *req);
 
 extern __u32 cookie_init_timestamp(struct request_sock *req);
 extern void cookie_check_timestamp(struct tcp_options_received *tcp_opt);
@@ -461,7 +461,7 @@  extern void cookie_check_timestamp(struct tcp_options_received *tcp_opt);
 /* From net/ipv6/syncookies.c */
 extern struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb);
 extern __u32 cookie_v6_init_sequence(struct sock *sk, struct sk_buff *skb,
-				     __u16 *mss);
+				     struct request_sock *req);
 
 /* tcp_output.c */
 
@@ -1000,7 +1000,9 @@  static inline void tcp_openreq_init(struct request_sock *req,
 	struct inet_request_sock *ireq = inet_rsk(req);
 
 	req->rcv_wnd = 0;		/* So that tcp_send_synack() knows! */
+	kmemcheck_annotate_bitfield(req, flags);
 	req->cookie_ts = 0;
+	req->req_acks = 0;
 	tcp_rsk(req)->rcv_isn = TCP_SKB_CB(skb)->seq;
 	req->mss = rx_opt->mss_clamp;
 	req->ts_recent = rx_opt->saw_tstamp ? rx_opt->rcv_tsval : 0;
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 5ec678a..8af9261 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -160,19 +160,20 @@  static __u16 const msstab[] = {
  * Generate a syncookie.  mssp points to the mss, which is returned
  * rounded down to the value encoded in the cookie.
  */
-__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
+__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb,
+			      struct request_sock *req)
 {
 	const struct iphdr *iph = ip_hdr(skb);
 	const struct tcphdr *th = tcp_hdr(skb);
 	int mssind;
-	const __u16 mss = *mssp;
+	const __u16 mss = req->mss;
 
 	tcp_synq_overflow(sk);
 
 	/* XXX sort msstab[] by probability?  Binary search? */
 	for (mssind = 0; mss > msstab[mssind + 1]; mssind++)
 		;
-	*mssp = msstab[mssind] + 1;
+	req->mss = msstab[mssind] + 1;
 
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESSENT);
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9971870..3a2dfc5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1286,7 +1286,7 @@  int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 		syn_flood_warning(skb);
 		req->cookie_ts = tmp_opt.tstamp_ok;
 #endif
-		isn = cookie_v4_init_sequence(sk, skb, &req->mss);
+		isn = cookie_v4_init_sequence(sk, skb, req);
 	} else if (!isn) {
 		struct inet_peer *peer = NULL;
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index e320afe..92778e6 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -642,9 +642,9 @@  struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 		return NULL;
 
 	/* If TCP_DEFER_ACCEPT is set, drop bare ACK. */
-	if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
+	if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept > req->req_acks &&
 	    TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
-		inet_csk(sk)->icsk_accept_queue.rskq_defer_accept--;
+		req->req_acks++;
 		inet_rsk(req)->acked = 1;
 		return NULL;
 	}
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index cbe55e5..60d024d 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -125,18 +125,18 @@  static __u32 check_tcp_syn_cookie(__u32 cookie, struct in6_addr *saddr,
 		& COOKIEMASK;
 }
 
-__u32 cookie_v6_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
+__u32 cookie_v6_init_sequence(struct sock *sk, struct sk_buff *skb, struct request_sock *req)
 {
 	struct ipv6hdr *iph = ipv6_hdr(skb);
 	const struct tcphdr *th = tcp_hdr(skb);
 	int mssind;
-	const __u16 mss = *mssp;
+	const __u16 mss = req->mss;
 
 	tcp_synq_overflow(sk);
 
 	for (mssind = 0; mss > msstab[mssind + 1]; mssind++)
 		;
-	*mssp = msstab[mssind] + 1;
+	req->mss = msstab[mssind] + 1;
 
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESSENT);
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 4517630..2717bcb 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1219,7 +1219,7 @@  static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 		TCP_ECN_create_request(req, tcp_hdr(skb));
 
 	if (want_cookie) {
-		isn = cookie_v6_init_sequence(sk, skb, &req->mss);
+		isn = cookie_v6_init_sequence(sk, skb, req);
 		req->cookie_ts = tmp_opt.tstamp_ok;
 	} else if (!isn) {
 		if (ipv6_opt_accepted(sk, skb) ||