Message ID | 49220D75.1070803@candelatech.com |
---|---|
State | Deferred, archived |
Delegated to: | David Miller |
Headers | show |
Ben Greear wrote: > Ok, here is the patch that implements this. The idea is to spread out > arp requests when you do something like start 500 TCP connections on 500 > MAC-VLANs talking to 500 other MAC-VLANs. > > With a retrans timer of 1 sec, and a high volume of traffic, and a > semi flaky network in between, my system will not resolve the ARPs > and the retransmits overload my processors. > > Setting the retrans timer to 5 secs on my system also works, so I'm > not sure if this patch is really required, but it might help keep arp > requests somewhat random in cases where arp timers would otherwise > try to all fire at the same time. > > This is against 2.6.25.20 plus my patches, but I believe it should > apply to a clean 2.6.25.20 as well. > > Comments are welcome. > > Signed-Off-By Ben Greear<greearb@candelatech.com> > > Thanks, > Ben > > > > ------------------------------------------------------------------------ > > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt > index 518ebe6..4c805b3 100644 > --- a/Documentation/filesystems/proc.txt > +++ b/Documentation/filesystems/proc.txt > @@ -2028,6 +2028,16 @@ Expression of retrans_time, which is deprecated, is in 1/100 seconds (for > IPv4) or in jiffies (for IPv6). > Expression of retrans_time_ms is in milliseconds. > > + > +retrans_rand_backof_ms > +---------------------- > + > +This is an extra delay (ms) for the retransmit timer. A random value between > +0 and retrans_rand_backof_ms will be added to the retrans_timer. Default > +is zero. Setting this to a larger value will help large broadcast domains > +resolve ARP (for instance, 500 mac-vlans talking to 500 other mac-vlans). > + > + > unres_qlen > ---------- > ... > > diff --git a/net/core/neighbour.c b/net/core/neighbour.c > index 19b8e00..ec1f048 100644 > --- a/net/core/neighbour.c > +++ b/net/core/neighbour.c > @@ -765,6 +765,13 @@ static __inline__ int neigh_max_probes(struct neighbour *n) > p->ucast_probes + p->app_probes + p->mcast_probes); > } > > +static unsigned long neigh_rand_retry(struct neighbour* neigh) { > + if (neigh->parms->retrans_rand_backoff) { > + return net_random() % neigh->parms->retrans_rand_backoff; > + } > + return 0; > +} > + > /* Called when a timer expires for a neighbour entry. */ I thought that mod was something we tried to avoid? Could you instead use something that isn't random but perhaps varies among all the requests? Say some of the low-order bits of the IP being resolved? It wouldn't necessarily be "fair" to some destination IP's but it should serve to spread things out a bit without having to generate a random number and mod it. rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Rick Jones wrote: >> +static unsigned long neigh_rand_retry(struct neighbour* neigh) { >> + if (neigh->parms->retrans_rand_backoff) { >> + return net_random() % neigh->parms->retrans_rand_backoff; >> + } >> + return 0; >> +} >> + >> /* Called when a timer expires for a neighbour entry. */ > > I thought that mod was something we tried to avoid? Could you instead > use something that isn't random but perhaps varies among all the > requests? Say some of the low-order bits of the IP being resolved? This is only called when we are going to retransmit an ARP, which shouldn't be in any sort of hot path, so I figured MOD was fine. The net_random is a very cheap method (last I checked), as well. So, I think that part is OK as it is, but I'm open to persuasion :) Thanks, Ben
Ben Greear wrote: > Rick Jones wrote: > >>> +static unsigned long neigh_rand_retry(struct neighbour* neigh) { >>> + if (neigh->parms->retrans_rand_backoff) { >>> + return net_random() % neigh->parms->retrans_rand_backoff; >>> + } >>> + return 0; >>> +} >>> + >>> /* Called when a timer expires for a neighbour entry. */ >> >> >> I thought that mod was something we tried to avoid? Could you instead >> use something that isn't random but perhaps varies among all the >> requests? Say some of the low-order bits of the IP being resolved? > > > This is only called when we are going to retransmit an ARP, which shouldn't > be in any sort of hot path, so I figured MOD was fine. > > The net_random is a very cheap method (last I checked), as well. > > So, I think that part is OK as it is, but I'm open to > persuasion :) Perhaps I'm confused, or simply channeling Emily Litella again, but if you only do this on the 1st through Nth retransmissions (ie after the first retransmission timer has popped) don't you still have a thundering herd problem on the first transmission and the first retransmission of ARP requests? rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Rick Jones wrote: > Ben Greear wrote: >> Rick Jones wrote: >> >>>> +static unsigned long neigh_rand_retry(struct neighbour* neigh) { >>>> + if (neigh->parms->retrans_rand_backoff) { >>>> + return net_random() % neigh->parms->retrans_rand_backoff; >>>> + } >>>> + return 0; >>>> +} >>>> + >>>> /* Called when a timer expires for a neighbour entry. */ >>> >>> >>> I thought that mod was something we tried to avoid? Could you >>> instead use something that isn't random but perhaps varies among all >>> the requests? Say some of the low-order bits of the IP being resolved? >> >> >> This is only called when we are going to retransmit an ARP, which >> shouldn't >> be in any sort of hot path, so I figured MOD was fine. >> >> The net_random is a very cheap method (last I checked), as well. >> >> So, I think that part is OK as it is, but I'm open to >> persuasion :) > > Perhaps I'm confused, or simply channeling Emily Litella again, but if > you only do this on the 1st through Nth retransmissions (ie after the > first retransmission timer has popped) don't you still have a thundering > herd problem on the first transmission and the first retransmission of > ARP requests? You'd certainly have it on the first transmission, but I think from there on the randomness should kick in. This is a pretty rare case, and I'd rather not slow down the initial ARP. If we *are* in the overload situation, then the network can just purge/drop/whatever the initial flood and then the retransmits should start doing their random thing. On my system, it still takes maybe 30 seconds for all the ARPs to resolve since a good deal of the requests and/or responses are being lost. After some more testing, I can still get it into a bad state if I have a retrans timer of 1 sec and a randomness of 5 secs and manage to cause all 1000 arp entries to go stale at once (by yanking a cable, for instance). It seems I have to bump up the base timer to 3-5 seconds (I'm leaving the random backoff at 5 secs as well). Thanks, Ben
From: Ben Greear <greearb@candelatech.com> Date: Mon, 17 Nov 2008 17:50:50 -0800 > Rick Jones wrote: > > Ben Greear wrote: > >> Rick Jones wrote: > >> > >>>> +static unsigned long neigh_rand_retry(struct neighbour* neigh) { > >>>> + if (neigh->parms->retrans_rand_backoff) { > >>>> + return net_random() % neigh->parms->retrans_rand_backoff; > >>>> + } > >>>> + return 0; > >>>> +} > >>>> + > >>>> /* Called when a timer expires for a neighbour entry. */ > >>> > >>> > >>> I thought that mod was something we tried to avoid? Could you instead use something that isn't random but perhaps varies among all the requests? Say some of the low-order bits of the IP being resolved? > >> > >> > >> This is only called when we are going to retransmit an ARP, which shouldn't > >> be in any sort of hot path, so I figured MOD was fine. > >> > >> The net_random is a very cheap method (last I checked), as well. > >> > >> So, I think that part is OK as it is, but I'm open to > >> persuasion :) > > Perhaps I'm confused, or simply channeling Emily Litella again, but if you only do this on the 1st through Nth retransmissions (ie after the first retransmission timer has popped) don't you still have a thundering herd problem on the first transmission and the first retransmission of ARP requests? > > You'd certainly have it on the first transmission, but I think from there on > the randomness should kick in. This is a pretty rare case, and I'd rather > not slow down the initial ARP. If we *are* in the overload situation, then > the network can just purge/drop/whatever the initial flood and then the > retransmits should start doing their random thing. On my system, it still > takes maybe 30 seconds for all the ARPs to resolve since a good deal of > the requests and/or responses are being lost. > > After some more testing, I can still get it into a bad > state if I have a retrans timer of 1 sec and a randomness of 5 secs > and manage to cause all 1000 arp entries to go stale at once (by > yanking a cable, for instance). > > It seems I have to bump up the base timer to 3-5 seconds (I'm > leaving the random backoff at 5 secs as well). This scheme still seems hackish to me, so I'm going to defer on this for now. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller wrote: >> After some more testing, I can still get it into a bad >> state if I have a retrans timer of 1 sec and a randomness of 5 secs >> and manage to cause all 1000 arp entries to go stale at once (by >> yanking a cable, for instance). >> >> It seems I have to bump up the base timer to 3-5 seconds (I'm >> leaving the random backoff at 5 secs as well). > > This scheme still seems hackish to me, so I'm going to defer on this > for now. You think something like an exponential backoff capped at some user-configurable max-value would be better?
On Thu, Nov 20, 2008 at 09:23:53AM -0800, Ben Greear wrote: > You think something like an exponential backoff capped at some > user-configurable > max-value would be better? I'll throw in an observation on arp behaviour on wifi / HomePNA: neither protocol provides reliable delivery of broadcast traffic, while point to point traffic is reliably delivered. If arp traffic is not sufficiently aggressive when a connection is first used, the user can end up waiting some time until one of the broadcast packets finally gets through. Doing an exponential backoff will make this significantly worse, unless the initial timeout is sufficiently small. -ben
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 518ebe6..4c805b3 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -2028,6 +2028,16 @@ Expression of retrans_time, which is deprecated, is in 1/100 seconds (for IPv4) or in jiffies (for IPv6). Expression of retrans_time_ms is in milliseconds. + +retrans_rand_backof_ms +---------------------- + +This is an extra delay (ms) for the retransmit timer. A random value between +0 and retrans_rand_backof_ms will be added to the retrans_timer. Default +is zero. Setting this to a larger value will help large broadcast domains +resolve ARP (for instance, 500 mac-vlans talking to 500 other mac-vlans). + + unres_qlen ---------- diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 8dbe468..a45b5df 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -608,6 +608,7 @@ enum { NET_NEIGH_GC_THRESH3=16, NET_NEIGH_RETRANS_TIME_MS=17, NET_NEIGH_REACHABLE_TIME_MS=18, + NET_NEIGH_RETRANS_RAND_BACKOFF=19, __NET_NEIGH_MAX }; diff --git a/include/net/neighbour.h b/include/net/neighbour.h index 64a5f01..4947976 100644 --- a/include/net/neighbour.h +++ b/include/net/neighbour.h @@ -65,6 +65,7 @@ struct neigh_parms int proxy_delay; int proxy_qlen; int locktime; + int retrans_rand_backoff; }; struct neigh_statistics diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c index 37c8fab..6f3467c 100644 --- a/kernel/sysctl_check.c +++ b/kernel/sysctl_check.c @@ -249,6 +249,7 @@ static const struct trans_ctl_table trans_net_neigh_vars_table[] = { { NET_NEIGH_GC_THRESH3, "gc_thresh3" }, { NET_NEIGH_RETRANS_TIME_MS, "retrans_time_ms" }, { NET_NEIGH_REACHABLE_TIME_MS, "base_reachable_time_ms" }, + { NET_NEIGH_RETRANS_RAND_BACKOFF, "retrans_rand_backoff_ms"}, {} }; diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 19b8e00..ec1f048 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -765,6 +765,13 @@ static __inline__ int neigh_max_probes(struct neighbour *n) p->ucast_probes + p->app_probes + p->mcast_probes); } +static unsigned long neigh_rand_retry(struct neighbour* neigh) { + if (neigh->parms->retrans_rand_backoff) { + return net_random() % neigh->parms->retrans_rand_backoff; + } + return 0; +} + /* Called when a timer expires for a neighbour entry. */ static void neigh_timer_handler(unsigned long arg) @@ -820,11 +827,11 @@ static void neigh_timer_handler(unsigned long arg) neigh->nud_state = NUD_PROBE; neigh->updated = jiffies; atomic_set(&neigh->probes, 0); - next = now + neigh->parms->retrans_time; + next = now + neigh->parms->retrans_time + neigh_rand_retry(neigh); } } else { /* NUD_PROBE|NUD_INCOMPLETE */ - next = now + neigh->parms->retrans_time; + next = now + neigh->parms->retrans_time + neigh_rand_retry(neigh); } if ((neigh->nud_state & (NUD_INCOMPLETE | NUD_PROBE)) && @@ -2642,6 +2649,14 @@ static struct neigh_sysctl_table { .strategy = &sysctl_ms_jiffies, }, { + .ctl_name = NET_NEIGH_RETRANS_RAND_BACKOFF, + .procname = "retrans_rand_backoff_ms", + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec_ms_jiffies, + .strategy = &sysctl_ms_jiffies, + }, + { .ctl_name = NET_NEIGH_GC_INTERVAL, .procname = "gc_interval", .maxlen = sizeof(int), @@ -2712,18 +2727,19 @@ int neigh_sysctl_register(struct net_device *dev, struct neigh_parms *p, t->neigh_vars[11].data = &p->locktime; t->neigh_vars[12].data = &p->retrans_time; t->neigh_vars[13].data = &p->base_reachable_time; + t->neigh_vars[14].data = &p->retrans_rand_backoff; if (dev) { dev_name_source = dev->name; neigh_path[NEIGH_CTL_PATH_DEV].ctl_name = dev->ifindex; /* Terminate the table early */ - memset(&t->neigh_vars[14], 0, sizeof(t->neigh_vars[14])); + memset(&t->neigh_vars[15], 0, sizeof(t->neigh_vars[14])); } else { dev_name_source = neigh_path[NEIGH_CTL_PATH_DEV].procname; - t->neigh_vars[14].data = (int *)(p + 1); - t->neigh_vars[15].data = (int *)(p + 1) + 1; - t->neigh_vars[16].data = (int *)(p + 1) + 2; - t->neigh_vars[17].data = (int *)(p + 1) + 3; + t->neigh_vars[15].data = (int *)(p + 1); + t->neigh_vars[16].data = (int *)(p + 1) + 1; + t->neigh_vars[17].data = (int *)(p + 1) + 2; + t->neigh_vars[18].data = (int *)(p + 1) + 3; }