Message ID | 1588021007-16914-3-git-send-email-roopa@cumulusnetworks.com |
---|---|
State | Accepted |
Delegated to: | David Miller |
Headers | show |
Series | New sysctl to turn off nexthop API compat mode | expand |
On 4/27/20 2:56 PM, Roopa Prabhu wrote: > From: Roopa Prabhu <roopa@cumulusnetworks.com> > > Current route nexthop API maintains user space compatibility > with old route API by default. Dumps and netlink notifications > support both new and old API format. In systems which have > moved to the new API, this compatibility mode cancels some > of the performance benefits provided by the new nexthop API. > > This patch adds new sysctl nexthop_compat_mode which is on > by default but provides the ability to turn off compatibility > mode allowing systems to run entirely with the new routing > API. Old route API behaviour and support is not modified by this > sysctl. > > Uses a single sysctl to cover both ipv4 and ipv6 following > other sysctls. Covers dumps and delete notifications as > suggested by David Ahern. > > Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> > --- > Documentation/networking/ip-sysctl.txt | 12 ++++++++++++ > include/net/netns/ipv4.h | 2 ++ > net/ipv4/af_inet.c | 1 + > net/ipv4/fib_semantics.c | 3 +++ > net/ipv4/nexthop.c | 5 +++-- > net/ipv4/sysctl_net_ipv4.c | 9 +++++++++ > net/ipv6/route.c | 3 ++- > 7 files changed, 32 insertions(+), 3 deletions(-) > Reviewed-by: David Ahern <dsahern@gmail.com>
Le 27/04/2020 à 22:56, Roopa Prabhu a écrit : > From: Roopa Prabhu <roopa@cumulusnetworks.com> > > Current route nexthop API maintains user space compatibility > with old route API by default. Dumps and netlink notifications > support both new and old API format. In systems which have > moved to the new API, this compatibility mode cancels some > of the performance benefits provided by the new nexthop API. > > This patch adds new sysctl nexthop_compat_mode which is on > by default but provides the ability to turn off compatibility > mode allowing systems to run entirely with the new routing > API. Old route API behaviour and support is not modified by this > sysctl. > > Uses a single sysctl to cover both ipv4 and ipv6 following > other sysctls. Covers dumps and delete notifications as > suggested by David Ahern. > > Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> > --- > Documentation/networking/ip-sysctl.txt | 12 ++++++++++++ > include/net/netns/ipv4.h | 2 ++ > net/ipv4/af_inet.c | 1 + > net/ipv4/fib_semantics.c | 3 +++ > net/ipv4/nexthop.c | 5 +++-- > net/ipv4/sysctl_net_ipv4.c | 9 +++++++++ > net/ipv6/route.c | 3 ++- > 7 files changed, 32 insertions(+), 3 deletions(-) > > diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt > index 6fcfd31..a8f2da4 100644 > --- a/Documentation/networking/ip-sysctl.txt > +++ b/Documentation/networking/ip-sysctl.txt > @@ -1553,6 +1553,18 @@ skip_notify_on_dev_down - BOOLEAN > on userspace caches to track link events and evict routes. > Default: false (generate message) > > +nexthop_compat_mode - BOOLEAN > + New nexthop API provides a means for managing nexthops independent of > + prefixes. Backwards compatibilty with old route format is enabled by > + default which means route dumps and notifications contain the new > + nexthop attribute but also the full, expanded nexthop definition. > + Further, updates or deletes of a nexthop configuration generate route > + notifications for each fib entry using the nexthop. Once a system > + understands the new API, this sysctl can be disabled to achieve full > + performance benefits of the new API by disabling the nexthop expansion > + and extraneous notifications. > + Default: true (backward compat mode) Maybe it could be good to allow only the transition true -> false to avoid nightmare debug. When the user chooses to leave the compat mode, it should never come back to it, it's not a game ;-) Regards, Nicolas
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 6fcfd31..a8f2da4 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -1553,6 +1553,18 @@ skip_notify_on_dev_down - BOOLEAN on userspace caches to track link events and evict routes. Default: false (generate message) +nexthop_compat_mode - BOOLEAN + New nexthop API provides a means for managing nexthops independent of + prefixes. Backwards compatibilty with old route format is enabled by + default which means route dumps and notifications contain the new + nexthop attribute but also the full, expanded nexthop definition. + Further, updates or deletes of a nexthop configuration generate route + notifications for each fib entry using the nexthop. Once a system + understands the new API, this sysctl can be disabled to achieve full + performance benefits of the new API by disabling the nexthop expansion + and extraneous notifications. + Default: true (backward compat mode) + IPv6 Fragmentation: ip6frag_high_thresh - INTEGER diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index 154b8f0..5acdb4d 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -111,6 +111,8 @@ struct netns_ipv4 { int sysctl_tcp_early_demux; int sysctl_udp_early_demux; + int sysctl_nexthop_compat_mode; + int sysctl_fwmark_reflect; int sysctl_tcp_fwmark_accept; #ifdef CONFIG_NET_L3_MASTER_DEV diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index c618e24..6177c4b 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1835,6 +1835,7 @@ static __net_init int inet_init_net(struct net *net) net->ipv4.sysctl_ip_early_demux = 1; net->ipv4.sysctl_udp_early_demux = 1; net->ipv4.sysctl_tcp_early_demux = 1; + net->ipv4.sysctl_nexthop_compat_mode = 1; #ifdef CONFIG_SYSCTL net->ipv4.sysctl_ip_prot_sock = PROT_SOCK; #endif diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 6ed8c93..7546b88 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -1780,6 +1780,8 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event, goto nla_put_failure; if (nexthop_is_blackhole(fi->nh)) rtm->rtm_type = RTN_BLACKHOLE; + if (!fi->fib_net->ipv4.sysctl_nexthop_compat_mode) + goto offload; } if (nhs == 1) { @@ -1805,6 +1807,7 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event, goto nla_put_failure; } +offload: if (fri->offload) rtm->rtm_flags |= RTM_F_OFFLOAD; if (fri->trap) diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 9999687..3957364 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -784,7 +784,8 @@ static void __remove_nexthop_fib(struct net *net, struct nexthop *nh) list_for_each_entry_safe(f6i, tmp, &nh->f6i_list, nh_list) { /* __ip6_del_rt does a release, so do a hold here */ fib6_info_hold(f6i); - ipv6_stub->ip6_del_rt(net, f6i, false); + ipv6_stub->ip6_del_rt(net, f6i, + !net->ipv4.sysctl_nexthop_compat_mode); } } @@ -1041,7 +1042,7 @@ static int insert_nexthop(struct net *net, struct nexthop *new_nh, if (!rc) { nh_base_seq_inc(net); nexthop_notify(RTM_NEWNEXTHOP, new_nh, &cfg->nlinfo); - if (replace_notify) + if (replace_notify && net->ipv4.sysctl_nexthop_compat_mode) nexthop_replace_notify(net, new_nh, &cfg->nlinfo); } diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 81b267e..95ad71e 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -711,6 +711,15 @@ static struct ctl_table ipv4_net_table[] = { .proc_handler = proc_tcp_early_demux }, { + .procname = "nexthop_compat_mode", + .data = &init_net.ipv4.sysctl_nexthop_compat_mode, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, + { .procname = "ip_default_ttl", .data = &init_net.ipv4.sysctl_ip_default_ttl, .maxlen = sizeof(int), diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 486c36a..803212a 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -5557,7 +5557,8 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb, if (nexthop_is_blackhole(rt->nh)) rtm->rtm_type = RTN_BLACKHOLE; - if (rt6_fill_node_nexthop(skb, rt->nh, &nh_flags) < 0) + if (net->ipv4.sysctl_nexthop_compat_mode && + rt6_fill_node_nexthop(skb, rt->nh, &nh_flags) < 0) goto nla_put_failure; rtm->rtm_flags |= nh_flags;