Message ID | 1348735261-29225-1-git-send-email-amwang@redhat.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, Sep 27, 2012 at 04:41:01PM +0800, Cong Wang wrote: > Some customer requests this feature, as they stated: > > "This parameter is necessary, especially for software that continually > creates many ephemeral processes which open sockets, to avoid socket > exhaustion. In many cases, the risk of the exhaustion can be reduced by > tuning reuse interval to allow sockets to be reusable earlier. > > In commercial Unix systems, this kind of parameters, such as > tcp_timewait in AIX and tcp_time_wait_interval in HP-UX, have > already been available. Their implementations allow users to tune > how long they keep TCP connection as TIME-WAIT state on the > millisecond time scale." > > We indeed have "tcp_tw_reuse" and "tcp_tw_recycle", but these tunings > are not equivalent in that they cannot be tuned directly on the time > scale nor in a safe way, as some combinations of tunings could still > cause some problem in NAT. And, I think second scale is enough, we don't > have to make it in millisecond time scale. > I think I have a little difficultly seeing how this does anything other than pay lip service to actually having sockets spend time in TIME_WAIT state. That is to say, while I see users using this to just make the pain stop. If we wait less time than it takes to be sure that a connection isn't being reused (either by waiting two segment lifetimes, or by checking timestamps), then you might as well not wait at all. I see how its tempting to be able to say "Just don't wait as long", but it seems that theres no difference between waiting half as long as the RFC mandates, and waiting no time at all. Neither is a good idea. Given the problem you're trying to solve here, I'll ask the standard question in response: How does using SO_REUSEADDR not solve the problem? Alternatively, in a pinch, why not reduce the tcp_max_tw_buckets sufficiently to start forcing TIME_WAIT sockets back into CLOSED state? The code looks fine, but the idea really doesn't seem like a good plan to me. I'm sure HPUX/Solaris/AIX/etc have done this in response to customer demand, but that doesn't make it the right solution. Regards Neil > See also: https://lkml.org/lkml/2008/11/15/80 > > Any comments? > > Cc: "David S. Miller" <davem@davemloft.net> > Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> > Cc: Patrick McHardy <kaber@trash.net> > Cc: Eric Dumazet <edumazet@google.com> > Cc: Neil Horman <nhorman@tuxdriver.com> > Signed-off-by: Cong Wang <amwang@redhat.com> > > --- > diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt > index c7fc107..4b24398 100644 > --- a/Documentation/networking/ip-sysctl.txt > +++ b/Documentation/networking/ip-sysctl.txt > @@ -520,6 +520,12 @@ tcp_tw_reuse - BOOLEAN > It should not be changed without advice/request of technical > experts. > > +tcp_tw_interval - INTEGER > + Specify the timeout, in seconds, of TIME-WAIT sockets. > + It should not be changed without advice/request of technical > + experts. > + Default: 60 > + > tcp_window_scaling - BOOLEAN > Enable window scaling as defined in RFC1323. > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index 6feeccd..72f92a1 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -114,9 +114,10 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo); > * initial RTO. > */ > > -#define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT > - * state, about 60 seconds */ > -#define TCP_FIN_TIMEOUT TCP_TIMEWAIT_LEN > +#define TCP_TIMEWAIT_LEN (sysctl_tcp_tw_interval * HZ) > + /* how long to wait to destroy TIME-WAIT > + * state, default 60 seconds */ > +#define TCP_FIN_TIMEOUT (60*HZ) > /* BSD style FIN_WAIT2 deadlock breaker. > * It used to be 3min, new value is 60sec, > * to combine FIN-WAIT-2 timeout with > @@ -292,6 +293,7 @@ extern int sysctl_tcp_thin_dupack; > extern int sysctl_tcp_early_retrans; > extern int sysctl_tcp_limit_output_bytes; > extern int sysctl_tcp_challenge_ack_limit; > +extern int sysctl_tcp_tw_interval; > > extern atomic_long_t tcp_memory_allocated; > extern struct percpu_counter tcp_sockets_allocated; > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c > index 9205e49..f99cacf 100644 > --- a/net/ipv4/sysctl_net_ipv4.c > +++ b/net/ipv4/sysctl_net_ipv4.c > @@ -27,6 +27,7 @@ > #include <net/tcp_memcontrol.h> > > static int zero; > +static int one = 1; > static int two = 2; > static int tcp_retr1_max = 255; > static int ip_local_port_range_min[] = { 1, 1 }; > @@ -271,6 +272,28 @@ bad_key: > return ret; > } > > +static int proc_tcp_tw_interval(ctl_table *ctl, int write, > + void __user *buffer, size_t *lenp, > + loff_t *ppos) > +{ > + int ret; > + ctl_table tmp = { > + .data = &sysctl_tcp_tw_interval, > + .maxlen = sizeof(int), > + .mode = ctl->mode, > + .extra1 = &one, > + }; > + > + ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos); > + if (ret) > + return ret; > + if (write) > + tcp_death_row.period = (HZ / INET_TWDR_TWKILL_SLOTS) > + * sysctl_tcp_tw_interval; > + > + return 0; > +} > + > static struct ctl_table ipv4_table[] = { > { > .procname = "tcp_timestamps", > @@ -794,6 +817,13 @@ static struct ctl_table ipv4_table[] = { > .proc_handler = proc_dointvec_minmax, > .extra1 = &zero > }, > + { > + .procname = "tcp_tw_interval", > + .data = &sysctl_tcp_tw_interval, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_tcp_tw_interval, > + }, > { } > }; > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index 93406c5..64af0b6 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -86,6 +86,7 @@ > #include <linux/scatterlist.h> > > int sysctl_tcp_tw_reuse __read_mostly; > +int sysctl_tcp_tw_interval __read_mostly = 60; > int sysctl_tcp_low_latency __read_mostly; > EXPORT_SYMBOL(sysctl_tcp_low_latency); > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > index 27536ba..e16f524 100644 > --- a/net/ipv4/tcp_minisocks.c > +++ b/net/ipv4/tcp_minisocks.c > @@ -34,7 +34,7 @@ int sysctl_tcp_abort_on_overflow __read_mostly; > > struct inet_timewait_death_row tcp_death_row = { > .sysctl_max_tw_buckets = NR_FILE * 2, > - .period = TCP_TIMEWAIT_LEN / INET_TWDR_TWKILL_SLOTS, > + .period = (60 * HZ) / INET_TWDR_TWKILL_SLOTS, > .death_lock = __SPIN_LOCK_UNLOCKED(tcp_death_row.death_lock), > .hashinfo = &tcp_hashinfo, > .tw_timer = TIMER_INITIALIZER(inet_twdr_hangman, 0, > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/27/2012 07:23 AM, Neil Horman wrote: > The code looks fine, but the idea really doesn't seem like a good plan to me. > I'm sure HPUX/Solaris/AIX/etc have done this in response to customer demand, but > that doesn't make it the right solution. In the case of HP-UX at least, while the rope is indeed there, the advice is to not wrap it around one's neck unless one *really* has a handle on the environment. Instead things suggested, in no particular order: *) The aforementioned SO_REUSEADDR to address the "I can't restart the server quickly enough." issue *) Tuning the size of the anonymous/ephemeral port range. *) Making explicit bind() calls using the entire non-privileged port range *) Making the connections longer-lived. Especially if the comms are between a fixed set of IP addresses. rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Cong Wang <amwang@redhat.com> Date: Thu, 27 Sep 2012 16:41:01 +0800 > In commercial Unix systems, this kind of parameters, such as > tcp_timewait in AIX and tcp_time_wait_interval in HP-UX, have > already been available. Their implementations allow users to tune > how long they keep TCP connection as TIME-WAIT state on the > millisecond time scale." This statement only makes me happy that these systems are not as widely deployed as Linux is. Furthermore, the mere existence of a facility in another system is never an argument for why we should have it too. Often it's instead a huge reason for us not to add it. Without appropriate confirmation that an early time-wait reuse is valid, decreasing this interval can only be dangerous. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2012-09-27 at 10:23 -0400, Neil Horman wrote: > On Thu, Sep 27, 2012 at 04:41:01PM +0800, Cong Wang wrote: > > Some customer requests this feature, as they stated: > > > > "This parameter is necessary, especially for software that continually > > creates many ephemeral processes which open sockets, to avoid socket > > exhaustion. In many cases, the risk of the exhaustion can be reduced by > > tuning reuse interval to allow sockets to be reusable earlier. > > > > In commercial Unix systems, this kind of parameters, such as > > tcp_timewait in AIX and tcp_time_wait_interval in HP-UX, have > > already been available. Their implementations allow users to tune > > how long they keep TCP connection as TIME-WAIT state on the > > millisecond time scale." > > > > We indeed have "tcp_tw_reuse" and "tcp_tw_recycle", but these tunings > > are not equivalent in that they cannot be tuned directly on the time > > scale nor in a safe way, as some combinations of tunings could still > > cause some problem in NAT. And, I think second scale is enough, we don't > > have to make it in millisecond time scale. > > > I think I have a little difficultly seeing how this does anything other than > pay lip service to actually having sockets spend time in TIME_WAIT state. That > is to say, while I see users using this to just make the pain stop. If we wait > less time than it takes to be sure that a connection isn't being reused (either > by waiting two segment lifetimes, or by checking timestamps), then you might as > well not wait at all. I see how its tempting to be able to say "Just don't wait > as long", but it seems that theres no difference between waiting half as long as > the RFC mandates, and waiting no time at all. Neither is a good idea. I don't think reducing TIME_WAIT is a good idea either, but there must be some reason behind as several UNIX provides a microsecond-scale tuning interface, or maybe in non-recycle mode, their RTO is much less than 2*MSL? > > Given the problem you're trying to solve here, I'll ask the standard question in > response: How does using SO_REUSEADDR not solve the problem? Alternatively, in > a pinch, why not reduce the tcp_max_tw_buckets sufficiently to start forcing > TIME_WAIT sockets back into CLOSED state? > > The code looks fine, but the idea really doesn't seem like a good plan to me. > I'm sure HPUX/Solaris/AIX/etc have done this in response to customer demand, but > that doesn't make it the right solution. > *I think* the customer doesn't want to modify their applications, so that is why they don't use SO_REUSERADDR. I didn't know tcp_max_tw_buckets can do the trick, nor the customer, so this is a side effect of tcp_max_tw_buckets? Is it documented? Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2012-09-27 at 13:05 -0400, David Miller wrote: > > Without appropriate confirmation that an early time-wait reuse is > valid, decreasing this interval can only be dangerous. Yeah, would a proper documentation cure this? Something like we did for other tuning: "It should not be changed without advice/request of technical experts." -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Cong Wang <amwang@redhat.com> Date: Fri, 28 Sep 2012 14:33:07 +0800 > I don't think reducing TIME_WAIT is a good idea either, but there must > be some reason behind as several UNIX provides a microsecond-scale > tuning interface, or maybe in non-recycle mode, their RTO is much less > than 2*MSL? Yes, there is a reason. It's there for retaining multi-million-dollar customers. There is no other reasons these other systems provide these facilities, they are simply there in an attempt to retain a dwindling customer base. Any other belief is extremely naive. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Cong Wang <amwang@redhat.com> Date: Fri, 28 Sep 2012 14:39:59 +0800 > On Thu, 2012-09-27 at 13:05 -0400, David Miller wrote: >> >> Without appropriate confirmation that an early time-wait reuse is >> valid, decreasing this interval can only be dangerous. > > Yeah, would a proper documentation cure this? Something like we did for > other tuning: > > "It should not be changed without advice/request of technical experts." No, we're not adding this facility. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 28, 2012 at 02:33:07PM +0800, Cong Wang wrote: > On Thu, 2012-09-27 at 10:23 -0400, Neil Horman wrote: > > On Thu, Sep 27, 2012 at 04:41:01PM +0800, Cong Wang wrote: > > > Some customer requests this feature, as they stated: > > > > > > "This parameter is necessary, especially for software that continually > > > creates many ephemeral processes which open sockets, to avoid socket > > > exhaustion. In many cases, the risk of the exhaustion can be reduced by > > > tuning reuse interval to allow sockets to be reusable earlier. > > > > > > In commercial Unix systems, this kind of parameters, such as > > > tcp_timewait in AIX and tcp_time_wait_interval in HP-UX, have > > > already been available. Their implementations allow users to tune > > > how long they keep TCP connection as TIME-WAIT state on the > > > millisecond time scale." > > > > > > We indeed have "tcp_tw_reuse" and "tcp_tw_recycle", but these tunings > > > are not equivalent in that they cannot be tuned directly on the time > > > scale nor in a safe way, as some combinations of tunings could still > > > cause some problem in NAT. And, I think second scale is enough, we don't > > > have to make it in millisecond time scale. > > > > > I think I have a little difficultly seeing how this does anything other than > > pay lip service to actually having sockets spend time in TIME_WAIT state. That > > is to say, while I see users using this to just make the pain stop. If we wait > > less time than it takes to be sure that a connection isn't being reused (either > > by waiting two segment lifetimes, or by checking timestamps), then you might as > > well not wait at all. I see how its tempting to be able to say "Just don't wait > > as long", but it seems that theres no difference between waiting half as long as > > the RFC mandates, and waiting no time at all. Neither is a good idea. > > I don't think reducing TIME_WAIT is a good idea either, but there must > be some reason behind as several UNIX provides a microsecond-scale > tuning interface, or maybe in non-recycle mode, their RTO is much less > than 2*MSL? > My guess? Cash was the reason. I certainly wasn't there for any of those developments, but a setting like this just smells to me like some customer waved some cash under IBM's/HP's/Sun's nose and said, "We'd like to get our tcp sockets back to CLOSED state faster, what can you do for us?" > > > > Given the problem you're trying to solve here, I'll ask the standard question in > > response: How does using SO_REUSEADDR not solve the problem? Alternatively, in > > a pinch, why not reduce the tcp_max_tw_buckets sufficiently to start forcing > > TIME_WAIT sockets back into CLOSED state? > > > > The code looks fine, but the idea really doesn't seem like a good plan to me. > > I'm sure HPUX/Solaris/AIX/etc have done this in response to customer demand, but > > that doesn't make it the right solution. > > > > *I think* the customer doesn't want to modify their applications, so > that is why they don't use SO_REUSERADDR. > Well, ok, thats a legitimate distro problem. What its not is an upstream problem. Fixing the appilcation is the right thing to do, wether or not they want to. > I didn't know tcp_max_tw_buckets can do the trick, nor the customer, so > this is a side effect of tcp_max_tw_buckets? Is it documented? man 7 tcp: tcp_max_tw_buckets (integer; default: see below; since Linux 2.4) The maximum number of sockets in TIME_WAIT state allowed in the system. This limit exists only to prevent simple denial-of-service attacks. The default value of NR_FILE*2 is adjusted depending on the memory in the system. If this number is exceeded, the socket is closed and a warning is printed. Neil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/27/2012 11:43 PM, David Miller wrote: > From: Cong Wang <amwang@redhat.com> > Date: Fri, 28 Sep 2012 14:33:07 +0800 > >> I don't think reducing TIME_WAIT is a good idea either, but there must >> be some reason behind as several UNIX provides a microsecond-scale >> tuning interface, or maybe in non-recycle mode, their RTO is much less >> than 2*MSL? Microsecond? HP-UX uses milliseconds for the units of the tunable, though that does not necessarily mean it will actually be implemented to millisecond accuracy > Yes, there is a reason. It's there for retaining multi-million-dollar > customers. > > There is no other reasons these other systems provide these > facilities, they are simply there in an attempt to retain a dwindling > customer base. > > Any other belief is extremely naive. HP-UX's TIME_WAIT interval tunability goes back to HP-UX 11.0, which first shipped in 1997. It got it by virtue of using a "Mentat-based" stack which had that functionality. I may not have my history completely correct, but Solaris 2 also got their networking bits from Mentat, and I believe shipped before HP-UX 11. To my recollection, neither were faced with a dwindling customer base at the time. rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2012-09-28 at 09:16 -0400, Neil Horman wrote: > On Fri, Sep 28, 2012 at 02:33:07PM +0800, Cong Wang wrote: > > On Thu, 2012-09-27 at 10:23 -0400, Neil Horman wrote: > > > On Thu, Sep 27, 2012 at 04:41:01PM +0800, Cong Wang wrote: > > > > Some customer requests this feature, as they stated: > > > > > > > > "This parameter is necessary, especially for software that continually > > > > creates many ephemeral processes which open sockets, to avoid socket > > > > exhaustion. In many cases, the risk of the exhaustion can be reduced by > > > > tuning reuse interval to allow sockets to be reusable earlier. > > > > > > > > In commercial Unix systems, this kind of parameters, such as > > > > tcp_timewait in AIX and tcp_time_wait_interval in HP-UX, have > > > > already been available. Their implementations allow users to tune > > > > how long they keep TCP connection as TIME-WAIT state on the > > > > millisecond time scale." > > > > > > > > We indeed have "tcp_tw_reuse" and "tcp_tw_recycle", but these tunings > > > > are not equivalent in that they cannot be tuned directly on the time > > > > scale nor in a safe way, as some combinations of tunings could still > > > > cause some problem in NAT. And, I think second scale is enough, we don't > > > > have to make it in millisecond time scale. > > > > > > > I think I have a little difficultly seeing how this does anything other than > > > pay lip service to actually having sockets spend time in TIME_WAIT state. That > > > is to say, while I see users using this to just make the pain stop. If we wait > > > less time than it takes to be sure that a connection isn't being reused (either > > > by waiting two segment lifetimes, or by checking timestamps), then you might as > > > well not wait at all. I see how its tempting to be able to say "Just don't wait > > > as long", but it seems that theres no difference between waiting half as long as > > > the RFC mandates, and waiting no time at all. Neither is a good idea. > > > > I don't think reducing TIME_WAIT is a good idea either, but there must > > be some reason behind as several UNIX provides a microsecond-scale > > tuning interface, or maybe in non-recycle mode, their RTO is much less > > than 2*MSL? > > > My guess? Cash was the reason. I certainly wasn't there for any of those > developments, but a setting like this just smells to me like some customer waved > some cash under IBM's/HP's/Sun's nose and said, "We'd like to get our tcp > sockets back to CLOSED state faster, what can you do for us?" Yeah, maybe. But it still doesn't make sense even if they are sure their packets are impossible to linger in their high-speed LAN for 2*MSL? > > > > > > > Given the problem you're trying to solve here, I'll ask the standard question in > > > response: How does using SO_REUSEADDR not solve the problem? Alternatively, in > > > a pinch, why not reduce the tcp_max_tw_buckets sufficiently to start forcing > > > TIME_WAIT sockets back into CLOSED state? > > > > > > The code looks fine, but the idea really doesn't seem like a good plan to me. > > > I'm sure HPUX/Solaris/AIX/etc have done this in response to customer demand, but > > > that doesn't make it the right solution. > > > > > > > *I think* the customer doesn't want to modify their applications, so > > that is why they don't use SO_REUSERADDR. > > > Well, ok, thats a legitimate distro problem. What its not is an upstream > problem. Fixing the appilcation is the right thing to do, wether or not they > want to. > > > I didn't know tcp_max_tw_buckets can do the trick, nor the customer, so > > this is a side effect of tcp_max_tw_buckets? Is it documented? > man 7 tcp: > tcp_max_tw_buckets (integer; default: see below; since Linux 2.4) > The maximum number of sockets in TIME_WAIT state allowed in the > system. This limit exists only to prevent simple > denial-of-service attacks. The default value of NR_FILE*2 is > adjusted depending on the memory in the system. If this number > is exceeded, the socket is closed and a warning is printed. > Hey, "a warning is printed" seems not very friendly. ;) Thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Oct 02, 2012 at 03:04:39PM +0800, Cong Wang wrote: > On Fri, 2012-09-28 at 09:16 -0400, Neil Horman wrote: > > On Fri, Sep 28, 2012 at 02:33:07PM +0800, Cong Wang wrote: > > > On Thu, 2012-09-27 at 10:23 -0400, Neil Horman wrote: > > > > On Thu, Sep 27, 2012 at 04:41:01PM +0800, Cong Wang wrote: > > > > > Some customer requests this feature, as they stated: > > > > > > > > > > "This parameter is necessary, especially for software that continually > > > > > creates many ephemeral processes which open sockets, to avoid socket > > > > > exhaustion. In many cases, the risk of the exhaustion can be reduced by > > > > > tuning reuse interval to allow sockets to be reusable earlier. > > > > > > > > > > In commercial Unix systems, this kind of parameters, such as > > > > > tcp_timewait in AIX and tcp_time_wait_interval in HP-UX, have > > > > > already been available. Their implementations allow users to tune > > > > > how long they keep TCP connection as TIME-WAIT state on the > > > > > millisecond time scale." > > > > > > > > > > We indeed have "tcp_tw_reuse" and "tcp_tw_recycle", but these tunings > > > > > are not equivalent in that they cannot be tuned directly on the time > > > > > scale nor in a safe way, as some combinations of tunings could still > > > > > cause some problem in NAT. And, I think second scale is enough, we don't > > > > > have to make it in millisecond time scale. > > > > > > > > > I think I have a little difficultly seeing how this does anything other than > > > > pay lip service to actually having sockets spend time in TIME_WAIT state. That > > > > is to say, while I see users using this to just make the pain stop. If we wait > > > > less time than it takes to be sure that a connection isn't being reused (either > > > > by waiting two segment lifetimes, or by checking timestamps), then you might as > > > > well not wait at all. I see how its tempting to be able to say "Just don't wait > > > > as long", but it seems that theres no difference between waiting half as long as > > > > the RFC mandates, and waiting no time at all. Neither is a good idea. > > > > > > I don't think reducing TIME_WAIT is a good idea either, but there must > > > be some reason behind as several UNIX provides a microsecond-scale > > > tuning interface, or maybe in non-recycle mode, their RTO is much less > > > than 2*MSL? > > > > > My guess? Cash was the reason. I certainly wasn't there for any of those > > developments, but a setting like this just smells to me like some customer waved > > some cash under IBM's/HP's/Sun's nose and said, "We'd like to get our tcp > > sockets back to CLOSED state faster, what can you do for us?" > > Yeah, maybe. But it still doesn't make sense even if they are sure their > packets are impossible to linger in their high-speed LAN for 2*MSL? > No it doesn't make sense, but the universal rule is that the business people will focus more on revenue recognition than on sound design pracice. > > > > > > > > > > Given the problem you're trying to solve here, I'll ask the standard question in > > > > response: How does using SO_REUSEADDR not solve the problem? Alternatively, in > > > > a pinch, why not reduce the tcp_max_tw_buckets sufficiently to start forcing > > > > TIME_WAIT sockets back into CLOSED state? > > > > > > > > The code looks fine, but the idea really doesn't seem like a good plan to me. > > > > I'm sure HPUX/Solaris/AIX/etc have done this in response to customer demand, but > > > > that doesn't make it the right solution. > > > > > > > > > > *I think* the customer doesn't want to modify their applications, so > > > that is why they don't use SO_REUSERADDR. > > > > > Well, ok, thats a legitimate distro problem. What its not is an upstream > > problem. Fixing the appilcation is the right thing to do, wether or not they > > want to. > > > > > I didn't know tcp_max_tw_buckets can do the trick, nor the customer, so > > > this is a side effect of tcp_max_tw_buckets? Is it documented? > > man 7 tcp: > > tcp_max_tw_buckets (integer; default: see below; since Linux 2.4) > > The maximum number of sockets in TIME_WAIT state allowed in the > > system. This limit exists only to prevent simple > > denial-of-service attacks. The default value of NR_FILE*2 is > > adjusted depending on the memory in the system. If this number > > is exceeded, the socket is closed and a warning is printed. > > > > Hey, "a warning is printed" seems not very friendly. ;) > No, its not very friendly, but the people using this are violating the RFC, which isn't very friendly. :) > Thanks! > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-10-02 at 08:09 -0400, Neil Horman wrote: > No, its not very friendly, but the people using this are violating the RFC, > which isn't very friendly. :) Could you be more specific? In RFC 793, AFAIK, it is allowed to be changed: http://tools.ietf.org/html/rfc793 " To be sure that a TCP does not create a segment that carries a sequence number which may be duplicated by an old segment remaining in the network, the TCP must keep quiet for a maximum segment lifetime (MSL) before assigning any sequence numbers upon starting up or recovering from a crash in which memory of sequence numbers in use was lost. For this specification the MSL is taken to be 2 minutes. This is an engineering choice, and may be changed if experience indicates it is desirable to do so." or I must still be missing something here... :) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Oct 08, 2012 at 11:17:37AM +0800, Cong Wang wrote: > On Tue, 2012-10-02 at 08:09 -0400, Neil Horman wrote: > > No, its not very friendly, but the people using this are violating the RFC, > > which isn't very friendly. :) > > Could you be more specific? In RFC 793, AFAIK, it is allowed to be > changed: > > http://tools.ietf.org/html/rfc793 > > " To be sure that a TCP does not create a segment that carries a > sequence number which may be duplicated by an old segment remaining in > the network, the TCP must keep quiet for a maximum segment lifetime > (MSL) before assigning any sequence numbers upon starting up or > recovering from a crash in which memory of sequence numbers in use was > lost. For this specification the MSL is taken to be 2 minutes. This > is an engineering choice, and may be changed if experience indicates > it is desirable to do so." > Its the length of time that represents an MSL that was the choice, not the fact that reusing a TCP before the expiration of the MSL is a bad idea. > or I must still be missing something here... :) > Next paragraph down: This specification provides that hosts which "crash" without retaining any knowledge of the last sequence numbers transmitted on each active (i.e., not closed) connection shall delay emitting any TCP segments for at least the agreed Maximum Segment Lifetime (MSL) in the internet system of which the host is a part. In the paragraphs below, an explanation for this specification is given. TCP implementors may violate the "quiet time" restriction, but only at the risk of causing some old data to be accepted as new or new data rejected as old duplicated by some receivers in the internet system. .... etc. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2012-10-08 at 10:07 -0400, Neil Horman wrote: > On Mon, Oct 08, 2012 at 11:17:37AM +0800, Cong Wang wrote: > > On Tue, 2012-10-02 at 08:09 -0400, Neil Horman wrote: > > > No, its not very friendly, but the people using this are violating the RFC, > > > which isn't very friendly. :) > > > > Could you be more specific? In RFC 793, AFAIK, it is allowed to be > > changed: > > > > http://tools.ietf.org/html/rfc793 > > > > " To be sure that a TCP does not create a segment that carries a > > sequence number which may be duplicated by an old segment remaining in > > the network, the TCP must keep quiet for a maximum segment lifetime > > (MSL) before assigning any sequence numbers upon starting up or > > recovering from a crash in which memory of sequence numbers in use was > > lost. For this specification the MSL is taken to be 2 minutes. This > > is an engineering choice, and may be changed if experience indicates > > it is desirable to do so." > > > Its the length of time that represents an MSL that was the choice, not the fact > that reusing a TCP before the expiration of the MSL is a bad idea. > > > or I must still be missing something here... :) > > > Next paragraph down: > This specification provides that hosts which "crash" without > retaining any knowledge of the last sequence numbers transmitted on > each active (i.e., not closed) connection shall delay emitting any > TCP segments for at least the agreed Maximum Segment Lifetime (MSL) > in the internet system of which the host is a part. In the > paragraphs below, an explanation for this specification is given. > TCP implementors may violate the "quiet time" restriction, but only > at the risk of causing some old data to be accepted as new or new > data rejected as old duplicated by some receivers in the internet > system. .... etc. > > Ah, ok. Thanks for the detailed answer! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index c7fc107..4b24398 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -520,6 +520,12 @@ tcp_tw_reuse - BOOLEAN It should not be changed without advice/request of technical experts. +tcp_tw_interval - INTEGER + Specify the timeout, in seconds, of TIME-WAIT sockets. + It should not be changed without advice/request of technical + experts. + Default: 60 + tcp_window_scaling - BOOLEAN Enable window scaling as defined in RFC1323. diff --git a/include/net/tcp.h b/include/net/tcp.h index 6feeccd..72f92a1 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -114,9 +114,10 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo); * initial RTO. */ -#define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT - * state, about 60 seconds */ -#define TCP_FIN_TIMEOUT TCP_TIMEWAIT_LEN +#define TCP_TIMEWAIT_LEN (sysctl_tcp_tw_interval * HZ) + /* how long to wait to destroy TIME-WAIT + * state, default 60 seconds */ +#define TCP_FIN_TIMEOUT (60*HZ) /* BSD style FIN_WAIT2 deadlock breaker. * It used to be 3min, new value is 60sec, * to combine FIN-WAIT-2 timeout with @@ -292,6 +293,7 @@ extern int sysctl_tcp_thin_dupack; extern int sysctl_tcp_early_retrans; extern int sysctl_tcp_limit_output_bytes; extern int sysctl_tcp_challenge_ack_limit; +extern int sysctl_tcp_tw_interval; extern atomic_long_t tcp_memory_allocated; extern struct percpu_counter tcp_sockets_allocated; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 9205e49..f99cacf 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -27,6 +27,7 @@ #include <net/tcp_memcontrol.h> static int zero; +static int one = 1; static int two = 2; static int tcp_retr1_max = 255; static int ip_local_port_range_min[] = { 1, 1 }; @@ -271,6 +272,28 @@ bad_key: return ret; } +static int proc_tcp_tw_interval(ctl_table *ctl, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + int ret; + ctl_table tmp = { + .data = &sysctl_tcp_tw_interval, + .maxlen = sizeof(int), + .mode = ctl->mode, + .extra1 = &one, + }; + + ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos); + if (ret) + return ret; + if (write) + tcp_death_row.period = (HZ / INET_TWDR_TWKILL_SLOTS) + * sysctl_tcp_tw_interval; + + return 0; +} + static struct ctl_table ipv4_table[] = { { .procname = "tcp_timestamps", @@ -794,6 +817,13 @@ static struct ctl_table ipv4_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = &zero }, + { + .procname = "tcp_tw_interval", + .data = &sysctl_tcp_tw_interval, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_tcp_tw_interval, + }, { } }; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 93406c5..64af0b6 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -86,6 +86,7 @@ #include <linux/scatterlist.h> int sysctl_tcp_tw_reuse __read_mostly; +int sysctl_tcp_tw_interval __read_mostly = 60; int sysctl_tcp_low_latency __read_mostly; EXPORT_SYMBOL(sysctl_tcp_low_latency); diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index 27536ba..e16f524 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -34,7 +34,7 @@ int sysctl_tcp_abort_on_overflow __read_mostly; struct inet_timewait_death_row tcp_death_row = { .sysctl_max_tw_buckets = NR_FILE * 2, - .period = TCP_TIMEWAIT_LEN / INET_TWDR_TWKILL_SLOTS, + .period = (60 * HZ) / INET_TWDR_TWKILL_SLOTS, .death_lock = __SPIN_LOCK_UNLOCKED(tcp_death_row.death_lock), .hashinfo = &tcp_hashinfo, .tw_timer = TIMER_INITIALIZER(inet_twdr_hangman, 0,
Some customer requests this feature, as they stated: "This parameter is necessary, especially for software that continually creates many ephemeral processes which open sockets, to avoid socket exhaustion. In many cases, the risk of the exhaustion can be reduced by tuning reuse interval to allow sockets to be reusable earlier. In commercial Unix systems, this kind of parameters, such as tcp_timewait in AIX and tcp_time_wait_interval in HP-UX, have already been available. Their implementations allow users to tune how long they keep TCP connection as TIME-WAIT state on the millisecond time scale." We indeed have "tcp_tw_reuse" and "tcp_tw_recycle", but these tunings are not equivalent in that they cannot be tuned directly on the time scale nor in a safe way, as some combinations of tunings could still cause some problem in NAT. And, I think second scale is enough, we don't have to make it in millisecond time scale. See also: https://lkml.org/lkml/2008/11/15/80 Any comments? Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Patrick McHardy <kaber@trash.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Cong Wang <amwang@redhat.com> --- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html