Message ID | 1502997440-32334-1-git-send-email-dsahern@gmail.com |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
Hello David, David Ahern <dsahern@gmail.com> writes: > @@ -2688,15 +2716,9 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev, > { > u32 tb_id; > struct net *net = dev_net(idev->dev); > - struct net_device *dev = net->loopback_dev; > + struct net_device *dev = idev->dev; > struct rt6_info *rt; > > - /* use L3 Master device as loopback for host routes if device > - * is enslaved and address is not link local or multicast > - */ > - if (!rt6_need_strict(addr)) > - dev = l3mdev_master_dev_rcu(idev->dev) ? : dev; > - > rt = ip6_dst_alloc(net, dev, DST_NOCOUNT); > if (!rt) > return ERR_PTR(-ENOMEM); I am afraid this change might break Java: <http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libnet/net_util_md.c#l574> I am all in for this change, but maybe it might be necessary to mask RTF_LOCAL routes with "lo" somehow. Bye, Hannes
On 8/18/17 5:15 PM, Hannes Frederic Sowa wrote: > Hello David, > > David Ahern <dsahern@gmail.com> writes: > >> @@ -2688,15 +2716,9 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev, >> { >> u32 tb_id; >> struct net *net = dev_net(idev->dev); >> - struct net_device *dev = net->loopback_dev; >> + struct net_device *dev = idev->dev; >> struct rt6_info *rt; >> >> - /* use L3 Master device as loopback for host routes if device >> - * is enslaved and address is not link local or multicast >> - */ >> - if (!rt6_need_strict(addr)) >> - dev = l3mdev_master_dev_rcu(idev->dev) ? : dev; >> - >> rt = ip6_dst_alloc(net, dev, DST_NOCOUNT); >> if (!rt) >> return ERR_PTR(-ENOMEM); > > I am afraid this change might break Java: > > <http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libnet/net_util_md.c#l574> > > I am all in for this change, but maybe it might be necessary to mask > RTF_LOCAL routes with "lo" somehow. That's asinine. The if_inet6 processing is just getting the 'lo' interface index. Why scan the file looking for that? The ipv6_route processing is assembling routes against the loopback device regardless of what the route is. Do you know why - what the route list is used for? If it matters, we could keep 'lo' as the device for RTF_LOCAL routes in the proc files to keep backwards compatibility.
On 8/18/17 6:05 PM, David Ahern wrote: > On 8/18/17 5:15 PM, Hannes Frederic Sowa wrote: >> Hello David, >> >> David Ahern <dsahern@gmail.com> writes: >> >>> @@ -2688,15 +2716,9 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev, >>> { >>> u32 tb_id; >>> struct net *net = dev_net(idev->dev); >>> - struct net_device *dev = net->loopback_dev; >>> + struct net_device *dev = idev->dev; >>> struct rt6_info *rt; >>> >>> - /* use L3 Master device as loopback for host routes if device >>> - * is enslaved and address is not link local or multicast >>> - */ >>> - if (!rt6_need_strict(addr)) >>> - dev = l3mdev_master_dev_rcu(idev->dev) ? : dev; >>> - >>> rt = ip6_dst_alloc(net, dev, DST_NOCOUNT); >>> if (!rt) >>> return ERR_PTR(-ENOMEM); >> >> I am afraid this change might break Java: >> >> <http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libnet/net_util_md.c#l574> >> >> I am all in for this change, but maybe it might be necessary to mask >> RTF_LOCAL routes with "lo" somehow. > > That's asinine. The if_inet6 processing is just getting the 'lo' > interface index. Why scan the file looking for that? The ipv6_route > processing is assembling routes against the loopback device regardless > of what the route is. Do you know why - what the route list is used for? If I read it correctly, seems to be a 2.4 workaround: - only user of the route list is needsLoopbackRoute() - only caller of needsLoopbackRoute is here: http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libnet/net_util_md.c#l828
David Ahern <dsahern@gmail.com> writes: > On 8/18/17 6:05 PM, David Ahern wrote: >> On 8/18/17 5:15 PM, Hannes Frederic Sowa wrote: >>> Hello David, >>> >>> David Ahern <dsahern@gmail.com> writes: >>> >>>> @@ -2688,15 +2716,9 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev, >>>> { >>>> u32 tb_id; >>>> struct net *net = dev_net(idev->dev); >>>> - struct net_device *dev = net->loopback_dev; >>>> + struct net_device *dev = idev->dev; >>>> struct rt6_info *rt; >>>> >>>> - /* use L3 Master device as loopback for host routes if device >>>> - * is enslaved and address is not link local or multicast >>>> - */ >>>> - if (!rt6_need_strict(addr)) >>>> - dev = l3mdev_master_dev_rcu(idev->dev) ? : dev; >>>> - >>>> rt = ip6_dst_alloc(net, dev, DST_NOCOUNT); >>>> if (!rt) >>>> return ERR_PTR(-ENOMEM); >>> >>> I am afraid this change might break Java: >>> >>> <http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libnet/net_util_md.c#l574> >>> >>> I am all in for this change, but maybe it might be necessary to mask >>> RTF_LOCAL routes with "lo" somehow. >> >> That's asinine. The if_inet6 processing is just getting the 'lo' >> interface index. Why scan the file looking for that? The ipv6_route >> processing is assembling routes against the loopback device regardless >> of what the route is. Do you know why - what the route list is used for? > > > If I read it correctly, seems to be a 2.4 workaround: > - only user of the route list is needsLoopbackRoute() > - only caller of needsLoopbackRoute is here: > > http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libnet/net_util_md.c#l828 I agree that it looks like dead code now. But I know for sure that this code has been excercised at least at some point in time and caused problems for JVMs on Linux with IPv6. On the top of this file I found this comment: -- >8 -- /* following code creates a list of addresses from the kernel * routing table that are routed via the loopback address. * We check all destination addresses against this table * and override the scope_id field to use the relevant value for "lo" * in order to work-around the Linux bug that prevents packets destined * for certain local addresses from being sent via a physical interface. */ -- 8< -- I don't know if it makes sense to dive down into java history (and I also found e.g. net-snmp scanning /proc/net/ipv6_route). The same problem might be visible via RTM_GETROUTE dumps if applications implement their own source address selection maybe. :/ Bye, Hannes
From: David Ahern <dsahern@gmail.com> Date: Fri, 18 Aug 2017 18:05:56 -0600 > On 8/18/17 5:15 PM, Hannes Frederic Sowa wrote: >> Hello David, >> >> David Ahern <dsahern@gmail.com> writes: >> >>> @@ -2688,15 +2716,9 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev, >>> { >>> u32 tb_id; >>> struct net *net = dev_net(idev->dev); >>> - struct net_device *dev = net->loopback_dev; >>> + struct net_device *dev = idev->dev; >>> struct rt6_info *rt; >>> >>> - /* use L3 Master device as loopback for host routes if device >>> - * is enslaved and address is not link local or multicast >>> - */ >>> - if (!rt6_need_strict(addr)) >>> - dev = l3mdev_master_dev_rcu(idev->dev) ? : dev; >>> - >>> rt = ip6_dst_alloc(net, dev, DST_NOCOUNT); >>> if (!rt) >>> return ERR_PTR(-ENOMEM); >> >> I am afraid this change might break Java: >> >> <http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libnet/net_util_md.c#l574> >> >> I am all in for this change, but maybe it might be necessary to mask >> RTF_LOCAL routes with "lo" somehow. > > That's asinine. To say the least. We really should suggest to them a better way to do what they are trying to do.
On 8/18/17 6:28 PM, Hannes Frederic Sowa wrote: > David Ahern <dsahern@gmail.com> writes: > >> On 8/18/17 6:05 PM, David Ahern wrote: >>> On 8/18/17 5:15 PM, Hannes Frederic Sowa wrote: >>>> Hello David, >>>> >>>> David Ahern <dsahern@gmail.com> writes: >>>> >>>>> @@ -2688,15 +2716,9 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev, >>>>> { >>>>> u32 tb_id; >>>>> struct net *net = dev_net(idev->dev); >>>>> - struct net_device *dev = net->loopback_dev; >>>>> + struct net_device *dev = idev->dev; >>>>> struct rt6_info *rt; >>>>> >>>>> - /* use L3 Master device as loopback for host routes if device >>>>> - * is enslaved and address is not link local or multicast >>>>> - */ >>>>> - if (!rt6_need_strict(addr)) >>>>> - dev = l3mdev_master_dev_rcu(idev->dev) ? : dev; >>>>> - >>>>> rt = ip6_dst_alloc(net, dev, DST_NOCOUNT); >>>>> if (!rt) >>>>> return ERR_PTR(-ENOMEM); >>>> >>>> I am afraid this change might break Java: >>>> >>>> <http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libnet/net_util_md.c#l574> >>>> >>>> I am all in for this change, but maybe it might be necessary to mask >>>> RTF_LOCAL routes with "lo" somehow. >>> >>> That's asinine. The if_inet6 processing is just getting the 'lo' >>> interface index. Why scan the file looking for that? The ipv6_route >>> processing is assembling routes against the loopback device regardless >>> of what the route is. Do you know why - what the route list is used for? >> >> >> If I read it correctly, seems to be a 2.4 workaround: >> - only user of the route list is needsLoopbackRoute() >> - only caller of needsLoopbackRoute is here: >> >> http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libnet/net_util_md.c#l828 > > I agree that it looks like dead code now. But I know for sure that this > code has been excercised at least at some point in time and caused > problems for JVMs on Linux with IPv6. > > On the top of this file I found this comment: > > -- >8 -- > /* following code creates a list of addresses from the kernel > * routing table that are routed via the loopback address. > * We check all destination addresses against this table > * and override the scope_id field to use the relevant value for "lo" > * in order to work-around the Linux bug that prevents packets destined > * for certain local addresses from being sent via a physical interface. > */ > -- 8< -- > > I don't know if it makes sense to dive down into java history (and I > also found e.g. net-snmp scanning /proc/net/ipv6_route). The same > problem might be visible via RTM_GETROUTE dumps if applications > implement their own source address selection maybe. :/ Dave: The java code appears to be dead code from 2.4 time frame. no longer relevant for this patch. Hannes: I believe the net-snmp use case is populating the MIB. MIB entries, like the proc files and rtnetlink responses, will have the change in that host routes show the device with the address rather than loopback. But, looking at the send code I don't see this change having an impact.
From: David Ahern <dsahern@gmail.com> Date: Thu, 17 Aug 2017 12:17:20 -0700 > One nagging difference between ipv4 and ipv6 is host routes for ipv6 > addresses are installed using the loopback device or VRF / L3 Master > device. e.g., > > 2001:db8:1::/120 dev veth0 proto kernel metric 256 pref medium > local 2001:db8:1::1 dev lo table local proto kernel metric 0 pref medium > > Using the loopback device is convenient -- necessary for local tx, but > has some nasty side effects, most notably setting the 'lo' device down > causes all host routes for all local IPv6 address to be removed from the > FIB and completely breaks IPv6 networking across all interfaces. > > This patch puts FIB entries for IPv6 routes against the device. This > simplifies the routes in the FIB, for example by making dst->dev and > rt6i_idev->dev the same (a future patch can look at removing the device > reference taken for rt6i_idev for FIB entries). > > When copies are made on FIB lookups, the cloned route has dst->dev > set to loopback (or the L3 master device). This is needed for the > local Tx of packets to local addresses. > > With fib entries allocated against the real network device, the addrconf > code that reinserts host routes on admin up of 'lo' is no longer needed. > > Signed-off-by: David Ahern <dsahern@gmail.com> I've applied this. I'm unlikely to revert this just for java, given the kinds of things they are doing.
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 640792e1ecb7..45d0a24644de 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -3030,9 +3030,6 @@ static void sit_add_v4_addrs(struct inet6_dev *idev) static void init_loopback(struct net_device *dev) { struct inet6_dev *idev; - struct net_device *sp_dev; - struct inet6_ifaddr *sp_ifa; - struct rt6_info *sp_rt; /* ::1 */ @@ -3045,45 +3042,6 @@ static void init_loopback(struct net_device *dev) } add_addr(idev, &in6addr_loopback, 128, IFA_HOST); - - /* Add routes to other interface's IPv6 addresses */ - for_each_netdev(dev_net(dev), sp_dev) { - if (!strcmp(sp_dev->name, dev->name)) - continue; - - idev = __in6_dev_get(sp_dev); - if (!idev) - continue; - - read_lock_bh(&idev->lock); - list_for_each_entry(sp_ifa, &idev->addr_list, if_list) { - - if (sp_ifa->flags & (IFA_F_DADFAILED | IFA_F_TENTATIVE)) - continue; - - if (sp_ifa->rt) { - /* This dst has been added to garbage list when - * lo device down, release this obsolete dst and - * reallocate a new router for ifa. - */ - if (!sp_ifa->rt->rt6i_node) { - ip6_rt_put(sp_ifa->rt); - sp_ifa->rt = NULL; - } else { - continue; - } - } - - sp_rt = addrconf_dst_alloc(idev, &sp_ifa->addr, false); - - /* Failure cases are ignored */ - if (!IS_ERR(sp_rt)) { - sp_ifa->rt = sp_rt; - ip6_ins_rt(sp_rt); - } - } - read_unlock_bh(&idev->lock); - } } void addrconf_add_linklocal(struct inet6_dev *idev, diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c index 8d7b113958b1..4f82830fc068 100644 --- a/net/ipv6/icmp.c +++ b/net/ipv6/icmp.c @@ -459,9 +459,20 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, * Source addr check */ - if (__ipv6_addr_needs_scope_id(addr_type)) + if (__ipv6_addr_needs_scope_id(addr_type)) { iif = skb->dev->ifindex; - else { + + /* for local packets, get the real device index */ + if (iif == LOOPBACK_IFINDEX) { + dst = skb_dst(skb); + if (dst) { + struct rt6_info *rt; + + rt = container_of(dst, struct rt6_info, dst); + iif = rt->rt6i_idev->dev->ifindex; + } + } + } else { dst = skb_dst(skb); iif = l3mdev_master_ifindex(dst ? dst->dev : skb->dev); } diff --git a/net/ipv6/route.c b/net/ipv6/route.c index dc021ed6dd37..1864effcaf00 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -958,10 +958,34 @@ int ip6_ins_rt(struct rt6_info *rt) return __ip6_ins_rt(rt, &info, &mxc, NULL); } +/* called with rcu_lock held */ +static struct net_device *ip6_rt_get_dev_rcu(struct rt6_info *rt) +{ + struct net_device *dev = rt->dst.dev; + + if (rt->rt6i_flags & RTF_LOCAL) { + /* for copies of local routes, dst->dev needs to be the + * device if it is a master device, the master device if + * device is enslaved, and the loopback as the default + */ + if (netif_is_l3_slave(dev) && + !rt6_need_strict(&rt->rt6i_dst.addr)) + dev = l3mdev_master_dev_rcu(dev); + else if (!netif_is_l3_master(dev)) + dev = dev_net(dev)->loopback_dev; + /* last case is netif_is_l3_master(dev) is true in which + * case we want dev returned to be dev + */ + } + + return dev; +} + static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort, const struct in6_addr *daddr, const struct in6_addr *saddr) { + struct net_device *dev; struct rt6_info *rt; /* @@ -971,8 +995,10 @@ static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort, if (ort->rt6i_flags & (RTF_CACHE | RTF_PCPU)) ort = (struct rt6_info *)ort->dst.from; - rt = __ip6_dst_alloc(dev_net(ort->dst.dev), ort->dst.dev, 0); - + rcu_read_lock(); + dev = ip6_rt_get_dev_rcu(ort); + rt = __ip6_dst_alloc(dev_net(dev), dev, 0); + rcu_read_unlock(); if (!rt) return NULL; @@ -1000,11 +1026,13 @@ static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort, static struct rt6_info *ip6_rt_pcpu_alloc(struct rt6_info *rt) { + struct net_device *dev; struct rt6_info *pcpu_rt; - pcpu_rt = __ip6_dst_alloc(dev_net(rt->dst.dev), - rt->dst.dev, rt->dst.flags); - + rcu_read_lock(); + dev = ip6_rt_get_dev_rcu(rt); + pcpu_rt = __ip6_dst_alloc(dev_net(dev), dev, rt->dst.flags); + rcu_read_unlock(); if (!pcpu_rt) return NULL; ip6_rt_copy_init(pcpu_rt, rt); @@ -2688,15 +2716,9 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev, { u32 tb_id; struct net *net = dev_net(idev->dev); - struct net_device *dev = net->loopback_dev; + struct net_device *dev = idev->dev; struct rt6_info *rt; - /* use L3 Master device as loopback for host routes if device - * is enslaved and address is not link local or multicast - */ - if (!rt6_need_strict(addr)) - dev = l3mdev_master_dev_rcu(idev->dev) ? : dev; - rt = ip6_dst_alloc(net, dev, DST_NOCOUNT); if (!rt) return ERR_PTR(-ENOMEM);
One nagging difference between ipv4 and ipv6 is host routes for ipv6 addresses are installed using the loopback device or VRF / L3 Master device. e.g., 2001:db8:1::/120 dev veth0 proto kernel metric 256 pref medium local 2001:db8:1::1 dev lo table local proto kernel metric 0 pref medium Using the loopback device is convenient -- necessary for local tx, but has some nasty side effects, most notably setting the 'lo' device down causes all host routes for all local IPv6 address to be removed from the FIB and completely breaks IPv6 networking across all interfaces. This patch puts FIB entries for IPv6 routes against the device. This simplifies the routes in the FIB, for example by making dst->dev and rt6i_idev->dev the same (a future patch can look at removing the device reference taken for rt6i_idev for FIB entries). When copies are made on FIB lookups, the cloned route has dst->dev set to loopback (or the L3 master device). This is needed for the local Tx of packets to local addresses. With fib entries allocated against the real network device, the addrconf code that reinserts host routes on admin up of 'lo' is no longer needed. Signed-off-by: David Ahern <dsahern@gmail.com> --- v2 - update icmp6_send to use rt6i_idev for linklocal addresses if skb dev is loopback. ICMP unreachables were hitting the wrong route in response to UDP messages to local linklocal addresses with no server listening net/ipv6/addrconf.c | 42 ------------------------------------------ net/ipv6/icmp.c | 15 +++++++++++++-- net/ipv6/route.c | 46 ++++++++++++++++++++++++++++++++++------------ 3 files changed, 47 insertions(+), 56 deletions(-)