diff mbox

[net-next] net: change fib behavior based on interface link status

Message ID 1433300839-18511-1-git-send-email-gospo@cumulusnetworks.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Andy Gospodarek June 3, 2015, 3:07 a.m. UTC
This patch adds the ability to have the Linux kernel track whether or
not a particular route should be used based on the link-status of the
interface associated with the next-hop.

Before this patch any link-failure on an interface that was serving as a
gateway for some systems could result in those systems being isolated
from the rest of the network as the stack would continue to attempt to
send frames out of an interface that is actually linked-down.  When the
kernel is responsible for all forwarding, it should also be responsible
for taking action when the traffic can no longer be forwarded -- there
is no real need to outsource link-monitoring to userspace anymore.

This feature is only enabled with the new sysctl set (default is off):
net.core.kill_routes_on_linkdown = 1

When this is set, the following behavior can be observed (interface p8p1
is link-down):

# ip route show 
default via 10.0.5.2 dev p9p1 
10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead 
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead 
90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
# ip route get 90.0.0.1 
90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1 
    cache 
# ip route get 80.0.0.1 
local 80.0.0.1 dev lo  src 80.0.0.1 
    cache <local> 
# ip route get 80.0.0.2
80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15 
    cache 

While the route does remain in the table (so it can be modified if
needed rather than being wiped away as it would be if IFF_UP was
cleared), the proper next-hop is chosen automatically when the link is
down.  Now interface p8p1 is linked-up:

# ip route show 
default via 10.0.5.2 dev p9p1 
10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 
90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
# ip route get 90.0.0.1 
90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1 
    cache 
# ip route get 80.0.0.1 
local 80.0.0.1 dev lo  src 80.0.0.1 
    cache <local> 
# ip route get 80.0.0.2
80.0.0.2 dev p8p1  src 80.0.0.1 
    cache 

and the output changes to what one would expect.

Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>

---
Though there were some that preferred not to have a configuration option
and to make this behavior the default when it was discussed in Ottawa
earlier this year since "it was time to do this."  I wanted to propose
the config option to preserve the current behavior for those that desire
it.  I'll happily remove it if Dave and Linus approve.

An IPv6 implementation is also needed (DECnet too!), but I wanted to
start with the IPv4 implementation to get people comfortable with the
idea before moving forward.  If this is accepted the IPv6 implementation
can be posted shortly.  

FWIW, we have been running this patch with the sysctl setting above and
our customers have been happily using a backported version for IPv4 and
IPv6 for >6 months.

 include/linux/netdevice.h      |  1 +
 include/net/fib_rules.h        |  1 +
 include/net/ip_fib.h           |  1 +
 include/uapi/linux/rtnetlink.h |  1 +
 include/uapi/linux/sysctl.h    |  1 +
 kernel/sysctl_binary.c         |  1 +
 net/core/dev.c                 |  2 ++
 net/core/sysctl_net_core.c     |  7 +++++++
 net/ipv4/fib_frontend.c        | 12 +++++++++--
 net/ipv4/fib_rules.c           |  7 ++++++-
 net/ipv4/fib_semantics.c       | 46 ++++++++++++++++++++++++++++++++++++------
 net/ipv4/fib_trie.c            | 19 +++++++++++++----
 12 files changed, 86 insertions(+), 13 deletions(-)

Comments

Scott Feldman June 3, 2015, 5:03 a.m. UTC | #1
On Tue, Jun 2, 2015 at 8:07 PM, Andy Gospodarek
<gospo@cumulusnetworks.com> wrote:
> This patch adds the ability to have the Linux kernel track whether or
> not a particular route should be used based on the link-status of the
> interface associated with the next-hop.
>
> Before this patch any link-failure on an interface that was serving as a
> gateway for some systems could result in those systems being isolated
> from the rest of the network as the stack would continue to attempt to
> send frames out of an interface that is actually linked-down.  When the
> kernel is responsible for all forwarding, it should also be responsible
> for taking action when the traffic can no longer be forwarded -- there
> is no real need to outsource link-monitoring to userspace anymore.

Hi Andy, how does this work for the hardware offload case?  I'm not
seeing how hardware gets updated when the software route is updated.
Seems hardware isn't updated and would send frames out the downed nh
interface.

-scott
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa June 3, 2015, 9:35 a.m. UTC | #2
On Wed, Jun 3, 2015, at 05:07, Andy Gospodarek wrote:
> This patch adds the ability to have the Linux kernel track whether or
> not a particular route should be used based on the link-status of the
> interface associated with the next-hop.
> 
> Before this patch any link-failure on an interface that was serving as a
> gateway for some systems could result in those systems being isolated
> from the rest of the network as the stack would continue to attempt to
> send frames out of an interface that is actually linked-down.  When the
> kernel is responsible for all forwarding, it should also be responsible
> for taking action when the traffic can no longer be forwarded -- there
> is no real need to outsource link-monitoring to userspace anymore.
> 
> This feature is only enabled with the new sysctl set (default is off):
> net.core.kill_routes_on_linkdown = 1
> 
> When this is set, the following behavior can be observed (interface p8p1
> is link-down):
> 
> # ip route show 
> default via 10.0.5.2 dev p9p1 
> 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead 
> 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead 
> 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> # ip route get 90.0.0.1 
> 90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1 
>     cache 
> # ip route get 80.0.0.1 
> local 80.0.0.1 dev lo  src 80.0.0.1 
>     cache <local> 
> # ip route get 80.0.0.2
> 80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15 
>     cache 
> 
> While the route does remain in the table (so it can be modified if
> needed rather than being wiped away as it would be if IFF_UP was
> cleared), the proper next-hop is chosen automatically when the link is
> down.  Now interface p8p1 is linked-up:
> 
> # ip route show 
> default via 10.0.5.2 dev p9p1 
> 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 
> 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 
> 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> 192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
> # ip route get 90.0.0.1 
> 90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1 
>     cache 
> # ip route get 80.0.0.1 
> local 80.0.0.1 dev lo  src 80.0.0.1 
>     cache <local> 
> # ip route get 80.0.0.2
> 80.0.0.2 dev p8p1  src 80.0.0.1 
>     cache 
> 
> and the output changes to what one would expect.
> 
> Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
> Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
> 
> ---
> Though there were some that preferred not to have a configuration option
> and to make this behavior the default when it was discussed in Ottawa
> earlier this year since "it was time to do this."  I wanted to propose
> the config option to preserve the current behavior for those that desire
> it.  I'll happily remove it if Dave and Linus approve.

I raised the concern that in case we don't have any other fallback route
and the kernel decides to send back ICMP errors to the end host, we
could kill TCP connections with those error messages. The current
behavior is that the packet gets silently dropped and TCP will retry, no
ICMP error message is send by immediate routers. This is especially
important if only a short link loss event happens on a default route.

Thus I prefer the configuration option with the default value being off.

This is a great feature, thanks!
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Gospodarek June 3, 2015, 1:49 p.m. UTC | #3
On Wed, Jun 03, 2015 at 11:35:09AM +0200, Hannes Frederic Sowa wrote:
> On Wed, Jun 3, 2015, at 05:07, Andy Gospodarek wrote:
> > This patch adds the ability to have the Linux kernel track whether or
> > not a particular route should be used based on the link-status of the
> > interface associated with the next-hop.
> > 
> > Before this patch any link-failure on an interface that was serving as a
> > gateway for some systems could result in those systems being isolated
> > from the rest of the network as the stack would continue to attempt to
> > send frames out of an interface that is actually linked-down.  When the
> > kernel is responsible for all forwarding, it should also be responsible
> > for taking action when the traffic can no longer be forwarded -- there
> > is no real need to outsource link-monitoring to userspace anymore.
> > 
> > This feature is only enabled with the new sysctl set (default is off):
> > net.core.kill_routes_on_linkdown = 1
> > 
> > When this is set, the following behavior can be observed (interface p8p1
> > is link-down):
> > 
> > # ip route show 
> > default via 10.0.5.2 dev p9p1 
> > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead 
> > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead 
> > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > # ip route get 90.0.0.1 
> > 90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1 
> >     cache 
> > # ip route get 80.0.0.1 
> > local 80.0.0.1 dev lo  src 80.0.0.1 
> >     cache <local> 
> > # ip route get 80.0.0.2
> > 80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15 
> >     cache 
> > 
> > While the route does remain in the table (so it can be modified if
> > needed rather than being wiped away as it would be if IFF_UP was
> > cleared), the proper next-hop is chosen automatically when the link is
> > down.  Now interface p8p1 is linked-up:
> > 
> > # ip route show 
> > default via 10.0.5.2 dev p9p1 
> > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 
> > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 
> > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > 192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
> > # ip route get 90.0.0.1 
> > 90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1 
> >     cache 
> > # ip route get 80.0.0.1 
> > local 80.0.0.1 dev lo  src 80.0.0.1 
> >     cache <local> 
> > # ip route get 80.0.0.2
> > 80.0.0.2 dev p8p1  src 80.0.0.1 
> >     cache 
> > 
> > and the output changes to what one would expect.
> > 
> > Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
> > Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
> > 
> > ---
> > Though there were some that preferred not to have a configuration option
> > and to make this behavior the default when it was discussed in Ottawa
> > earlier this year since "it was time to do this."  I wanted to propose
> > the config option to preserve the current behavior for those that desire
> > it.  I'll happily remove it if Dave and Linus approve.
> 
> I raised the concern that in case we don't have any other fallback route
> and the kernel decides to send back ICMP errors to the end host, we
> could kill TCP connections with those error messages. The current
> behavior is that the packet gets silently dropped and TCP will retry, no
> ICMP error message is send by immediate routers. This is especially
> important if only a short link loss event happens on a default route.
If you do not have any default route configured (or your default route
is the one that went down!), then you could see this happening.
> 
[...]
> 
> This is a great feature, thanks!
Glad you like it.

> Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman June 3, 2015, 1:53 p.m. UTC | #4
On Tue, Jun 02, 2015 at 11:07:19PM -0400, Andy Gospodarek wrote:
> This patch adds the ability to have the Linux kernel track whether or
> not a particular route should be used based on the link-status of the
> interface associated with the next-hop.
> 
> Before this patch any link-failure on an interface that was serving as a
> gateway for some systems could result in those systems being isolated
> from the rest of the network as the stack would continue to attempt to
> send frames out of an interface that is actually linked-down.  When the
> kernel is responsible for all forwarding, it should also be responsible
> for taking action when the traffic can no longer be forwarded -- there
> is no real need to outsource link-monitoring to userspace anymore.
> 
> This feature is only enabled with the new sysctl set (default is off):
> net.core.kill_routes_on_linkdown = 1
> 
> When this is set, the following behavior can be observed (interface p8p1
> is link-down):
> 
> # ip route show 
> default via 10.0.5.2 dev p9p1 
> 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead 
> 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead 
> 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> # ip route get 90.0.0.1 
> 90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1 
>     cache 
> # ip route get 80.0.0.1 
> local 80.0.0.1 dev lo  src 80.0.0.1 
>     cache <local> 
> # ip route get 80.0.0.2
> 80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15 
>     cache 
> 
> While the route does remain in the table (so it can be modified if
> needed rather than being wiped away as it would be if IFF_UP was
> cleared), the proper next-hop is chosen automatically when the link is
> down.  Now interface p8p1 is linked-up:
> 
> # ip route show 
> default via 10.0.5.2 dev p9p1 
> 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 
> 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 
> 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> 192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
> # ip route get 90.0.0.1 
> 90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1 
>     cache 
> # ip route get 80.0.0.1 
> local 80.0.0.1 dev lo  src 80.0.0.1 
>     cache <local> 
> # ip route get 80.0.0.2
> 80.0.0.2 dev p8p1  src 80.0.0.1 
>     cache 
> 
> and the output changes to what one would expect.
> 
> Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
> Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
> 
> ---
> Though there were some that preferred not to have a configuration option
> and to make this behavior the default when it was discussed in Ottawa
> earlier this year since "it was time to do this."  I wanted to propose
> the config option to preserve the current behavior for those that desire
> it.  I'll happily remove it if Dave and Linus approve.
> 
> An IPv6 implementation is also needed (DECnet too!), but I wanted to
> start with the IPv4 implementation to get people comfortable with the
> idea before moving forward.  If this is accepted the IPv6 implementation
> can be posted shortly.  
> 
> FWIW, we have been running this patch with the sysctl setting above and
> our customers have been happily using a backported version for IPv4 and
> IPv6 for >6 months.
> 
>  include/linux/netdevice.h      |  1 +
>  include/net/fib_rules.h        |  1 +
>  include/net/ip_fib.h           |  1 +
>  include/uapi/linux/rtnetlink.h |  1 +
>  include/uapi/linux/sysctl.h    |  1 +
>  kernel/sysctl_binary.c         |  1 +
>  net/core/dev.c                 |  2 ++
>  net/core/sysctl_net_core.c     |  7 +++++++
>  net/ipv4/fib_frontend.c        | 12 +++++++++--
>  net/ipv4/fib_rules.c           |  7 ++++++-
>  net/ipv4/fib_semantics.c       | 46 ++++++++++++++++++++++++++++++++++++------
>  net/ipv4/fib_trie.c            | 19 +++++++++++++----
>  12 files changed, 86 insertions(+), 13 deletions(-)
> 

Hey Andy-
	So, the implementation looks great.  That said, a question comes up in
my mind that I can't quite resolve.  In looking at your code, and seeing how it
fit into the routing path, I came accross the function sync_fib_down_dev, which
appears to be called in response to NETDEV_DOWN events.  Its purpose from my
read is to interrogate the fib for routes that specify the device going down as
the egress device, and mark said routes as dead.  I'm guessing that I'm missing
something here, but it seems as though this patch shouldn't be needed in light
of that behavior (unless the existing code is somehow broken, or I'm confused
about its purpose).  Could it be that whats really needed here is for Network
Manager (or other user space programs ) to stop removing routes in response to
NETDEV_DOWN events so that the above kernel code can do its job?

Regards
Neil

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa June 3, 2015, 2:03 p.m. UTC | #5
On Di, 2015-06-02 at 23:07 -0400, Andy Gospodarek wrote:
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 6f5f71f..5bd953c 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2986,6 +2986,7 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
>  bool is_skb_forwardable(struct net_device *dev, struct sk_buff *skb);
>  
>  extern int		netdev_budget;
> +extern int		kill_routes_on_linkdown;

Would it make sense to make it per netns?

>  
>  /* Called by rtnetlink.c:rtnl_unlock() */
>  void netdev_run_todo(void);
> diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
> index 6d67383..4fbfda5 100644
> --- a/include/net/fib_rules.h
> +++ b/include/net/fib_rules.h
> @@ -37,6 +37,7 @@ struct fib_lookup_arg {
>  	struct fib_rule		*rule;
>  	int			flags;
>  #define FIB_LOOKUP_NOREF	1
> +#define FIB_LOOKUP_ALLOWDEAD	2
>  };
>  
>  struct fib_rules_ops {
> diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> index 54271ed..efb195b 100644
> --- a/include/net/ip_fib.h
> +++ b/include/net/ip_fib.h
> @@ -250,6 +250,7 @@ struct fib_table *fib_new_table(struct net *net, u32 id);
>  struct fib_table *fib_get_table(struct net *net, u32 id);
>  
>  int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res);
> +int fib_lookup_flags(struct net *n, struct flowi4 *flp, struct fib_result *res, int flags);
>  
>  static inline int fib_lookup(struct net *net, struct flowi4 *flp,
>  			     struct fib_result *res)
> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
> index 17fb02f..bc264d4 100644
> --- a/include/uapi/linux/rtnetlink.h
> +++ b/include/uapi/linux/rtnetlink.h
> @@ -338,6 +338,7 @@ struct rtnexthop {
>  #define RTNH_F_PERVASIVE	2	/* Do recursive gateway lookup	*/
>  #define RTNH_F_ONLINK		4	/* Gateway is forced on link	*/
>  #define RTNH_F_OFFLOAD		8	/* offloaded route */
> +#define RTNH_F_DEAD_LINKDOWN	16	/* Route dead due to carrier down */
>  
>  /* Macros to handle hexthops */
>  

-- >8 --

> diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
> index 0956373..b6632dc 100644
> --- a/include/uapi/linux/sysctl.h
> +++ b/include/uapi/linux/sysctl.h
> @@ -276,6 +276,7 @@ enum
>  	NET_CORE_AEVENT_ETIME=20,
>  	NET_CORE_AEVENT_RSEQTH=21,
>  	NET_CORE_WARNINGS=22,
> +	NET_CORE_KILL_ROUTES_ON_LINKDOWN=23,
>  };
>  
>  /* /proc/sys/net/ethernet */
> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
> index 7e7746a..f319116 100644
> --- a/kernel/sysctl_binary.c
> +++ b/kernel/sysctl_binary.c
> @@ -196,6 +196,7 @@ static const struct bin_table bin_net_core_table[] = {
>  	{ CTL_INT,	NET_CORE_AEVENT_ETIME,	"xfrm_aevent_etime" },
>  	{ CTL_INT,	NET_CORE_AEVENT_RSEQTH,	"xfrm_aevent_rseqth" },
>  	{ CTL_INT,	NET_CORE_WARNINGS,	"warnings" },
> +	{ CTL_INT,	NET_CORE_KILL_ROUTES_ON_LINKDOWN,	"kill_routes_on_linkdown" },
>  	{},
>  };

-- >8 --

I don't think we need this hunk any more as we don't update binary
sysctls anymore.

>  
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0602e91..50dc19c 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3167,6 +3167,8 @@ EXPORT_SYMBOL(netdev_max_backlog);
>  int netdev_tstamp_prequeue __read_mostly = 1;
>  int netdev_budget __read_mostly = 300;
>  int weight_p __read_mostly = 64;            /* old backlog weight */
> +int kill_routes_on_linkdown = 0;
> +EXPORT_SYMBOL(kill_routes_on_linkdown);
>  
>  /* Called with irq disabled */
>  static inline void ____napi_schedule(struct softnet_data *sd,
> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index 95b6139..fef1804 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c
> @@ -392,6 +392,13 @@ static struct ctl_table net_core_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec
>  	},
> +	{
> +		.procname	= "kill_routes_on_linkdown",
> +		.data		= &kill_routes_on_linkdown,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec
> +	},
>  	{ }
>  };
>  
> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> index 872494e..94348c7 100644
> --- a/net/ipv4/fib_frontend.c
> +++ b/net/ipv4/fib_frontend.c
> @@ -1107,6 +1107,7 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
>  	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
>  	struct in_device *in_dev;
>  	struct net *net = dev_net(dev);
> +	unsigned flags;
>  
>  	if (event == NETDEV_UNREGISTER) {
>  		fib_disable_ip(dev, 2);
> @@ -1130,10 +1131,17 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
>  		rt_cache_flush(net);
>  		break;
>  	case NETDEV_DOWN:
> -		fib_disable_ip(dev, 0);
> +		fib_disable_ip(dev, 1);

Shouldn't this depend on the sysctl?

>  		break;
> -	case NETDEV_CHANGEMTU:
>  	case NETDEV_CHANGE:
> +		if (kill_routes_on_linkdown) {
> +			flags = dev_get_flags(dev);
> +			if (flags & (IFF_RUNNING|IFF_LOWER_UP))
> +				fib_sync_up(dev);
> +			else
> +				fib_sync_down_dev(dev, 0);
> +		}
> +	case NETDEV_CHANGEMTU:
>  		rt_cache_flush(net);
>  		break;
>  	}
> diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
> index 5615198..d135ec9 100644
> --- a/net/ipv4/fib_rules.c
> +++ b/net/ipv4/fib_rules.c
> @@ -49,9 +49,14 @@ struct fib4_rule {
>  
>  int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res)
>  {
> +	return fib_lookup_flags(net, flp, res, 0);
> +}
>
> +int fib_lookup_flags(struct net *net, struct flowi4 *flp, struct fib_result *res, int flags)
> +{
>  	struct fib_lookup_arg arg = {
>  		.result = res,
> -		.flags = FIB_LOOKUP_NOREF,
> +		.flags = FIB_LOOKUP_NOREF | flags,
>  	};
>  	int err;

Can fib_lookup take the flags argument always without adding a new
wrapper?
 
> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index 28ec3c1..c0874ee 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -266,7 +266,7 @@ static inline int nh_comp(const struct fib_info *fi, const struct fib_info *ofi)
>  #ifdef CONFIG_IP_ROUTE_CLASSID
>  		    nh->nh_tclassid != onh->nh_tclassid ||
>  #endif
> -		    ((nh->nh_flags ^ onh->nh_flags) & ~RTNH_F_DEAD))
> +		    ((nh->nh_flags ^ onh->nh_flags) & ~(RTNH_F_DEAD|RTNH_F_DEAD_LINKDOWN)))
>  			return -1;
>  		onh++;
>  	} endfor_nexthops(fi);
> @@ -318,7 +318,7 @@ static struct fib_info *fib_find_info(const struct fib_info *nfi)
>  		    nfi->fib_type == fi->fib_type &&
>  		    memcmp(nfi->fib_metrics, fi->fib_metrics,
>  			   sizeof(u32) * RTAX_MAX) == 0 &&
> -		    ((nfi->fib_flags ^ fi->fib_flags) & ~RTNH_F_DEAD) == 0 &&
> +		    ((nfi->fib_flags ^ fi->fib_flags) & ~(RTNH_F_DEAD|RTNH_F_DEAD_LINKDOWN)) == 0 &&
>  		    (nfi->fib_nhs == 0 || nh_comp(fi, nfi) == 0))
>  			return fi;
>  	}
> @@ -604,6 +604,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
>  				return -ENODEV;
>  			if (!(dev->flags & IFF_UP))
>  				return -ENETDOWN;
> +			if (!netif_carrier_ok(dev) && kill_routes_on_linkdown)
> +				nh->nh_flags |= RTNH_F_DEAD;
>  			nh->nh_dev = dev;
>  			dev_hold(dev);
>  			nh->nh_scope = RT_SCOPE_LINK;
> @@ -621,7 +623,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
>  			/* It is not necessary, but requires a bit of thinking */
>  			if (fl4.flowi4_scope < RT_SCOPE_LINK)
>  				fl4.flowi4_scope = RT_SCOPE_LINK;
> -			err = fib_lookup(net, &fl4, &res);
> +			err = fib_lookup_flags(net, &fl4, &res, FIB_LOOKUP_ALLOWDEAD);
>  			if (err) {
>  				rcu_read_unlock();
>  				return err;
> @@ -636,6 +638,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
>  		if (!dev)
>  			goto out;
>  		dev_hold(dev);
> +		if (!netif_carrier_ok(dev) && kill_routes_on_linkdown)
> +			nh->nh_flags |= RTNH_F_DEAD;
>  		err = (dev->flags & IFF_UP) ? 0 : -ENETDOWN;
>  	} else {
>  		struct in_device *in_dev;
> @@ -654,6 +658,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
>  		nh->nh_dev = in_dev->dev;
>  		dev_hold(nh->nh_dev);
>  		nh->nh_scope = RT_SCOPE_HOST;
> +		if (!netif_carrier_ok(nh->nh_dev) && kill_routes_on_linkdown)
> +			nh->nh_flags |= RTNH_F_DEAD;
>  		err = 0;
>  	}
>  out:
> @@ -760,6 +766,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
>  	struct fib_info *ofi;
>  	int nhs = 1;
>  	struct net *net = cfg->fc_nlinfo.nl_net;
> +	int dead;
>  
>  	if (cfg->fc_type > RTN_MAX)
>  		goto err_inval;
> @@ -920,11 +927,18 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
>  		if (!nh->nh_dev)
>  			goto failure;
>  	} else {
> +		dead = 0;
>  		change_nexthops(fi) {
>  			err = fib_check_nh(cfg, fi, nexthop_nh);
>  			if (err != 0)
>  				goto failure;
> +			if (nexthop_nh->nh_flags & RTNH_F_DEAD)
> +				dead++;
>  		} endfor_nexthops(fi)
> +		if ((dead == fi->fib_nhs)) {
> +			fi->fib_flags |= RTNH_F_DEAD;
> +			fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
> +		}
>  	}
>  
>  	if (fi->fib_prefsrc) {
> @@ -1097,6 +1111,8 @@ int fib_sync_down_addr(struct net *net, __be32 local)
>  			continue;
>  		if (fi->fib_prefsrc == local) {
>  			fi->fib_flags |= RTNH_F_DEAD;
> +			/* Addr is gone, more serious than a linkdown */
> +			fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
>  			ret++;
>  		}
>  	}
> @@ -1112,7 +1128,7 @@ int fib_sync_down_dev(struct net_device *dev, int force)
>  	struct hlist_head *head = &fib_info_devhash[hash];
>  	struct fib_nh *nh;
>  
> -	if (force)
> +	if (force > 1)
>  		scope = -1;

Why was this changed? I think force can be made a boolean.

>  
>  	hlist_for_each_entry(nh, head, nh_hash) {
> @@ -1147,6 +1163,11 @@ int fib_sync_down_dev(struct net_device *dev, int force)
>  		} endfor_nexthops(fi)
>  		if (dead == fi->fib_nhs) {
>  			fi->fib_flags |= RTNH_F_DEAD;
> +			/* force marks route down due to admin down and device removal. */
> +			if (!force)
> +				fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
> +			else
> +				fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
>  			ret++;
>  		}
>  	}
> @@ -1223,10 +1244,12 @@ int fib_sync_up(struct net_device *dev)
>  	struct hlist_head *head;
>  	struct fib_nh *nh;
>  	int ret;
> +	int link_up;
>  
>  	if (!(dev->flags & IFF_UP))
>  		return 0;
>  
> +	link_up = netif_carrier_ok(dev);
>  	prev_fi = NULL;
>  	hash = fib_devindex_hashfn(dev->ifindex);
>  	head = &fib_info_devhash[hash];
> @@ -1253,16 +1276,27 @@ int fib_sync_up(struct net_device *dev)
>  			if (nexthop_nh->nh_dev != dev ||
>  			    !__in_dev_get_rtnl(dev))
>  				continue;
> -			alive++;
> +			if (link_up) {
> +				/* Link is up, so mark NH as alive */
> +				nexthop_nh->nh_flags &= ~RTNH_F_DEAD;
> +				alive++;
> +			} else
> +				nexthop_nh->nh_flags |= RTNH_F_DEAD;
> +
>  			spin_lock_bh(&fib_multipath_lock);
>  			nexthop_nh->nh_power = 0;
> -			nexthop_nh->nh_flags &= ~RTNH_F_DEAD;
>  			spin_unlock_bh(&fib_multipath_lock);
>  		} endfor_nexthops(fi)
>  
>  		if (alive > 0) {
> +			/* Some NHs are alive, unmark the route as dead */
> +			fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
>  			fi->fib_flags &= ~RTNH_F_DEAD;
>  			ret++;
> +		} else {
> +			/* No NHs are alive, mark the route as dead */
> +			fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
> +			fi->fib_flags |= RTNH_F_DEAD;
>  		}
>  	}
>  
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 01bce15..eedf287 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -1401,12 +1401,18 @@ found:
>  #endif
>  			return err;
>  		}
> -		if (fi->fib_flags & RTNH_F_DEAD)
> -			continue;
> +		if (!(fib_flags & FIB_LOOKUP_ALLOWDEAD)) {
> +			/* if route is dead and link is down, keep looking  */
> +			if ((fi->fib_flags & RTNH_F_DEAD) &&
> +			    (fi->fib_flags & RTNH_F_DEAD_LINKDOWN))
> +				continue;
> +		}
>  		for (nhsel = 0; nhsel < fi->fib_nhs; nhsel++) {
>  			const struct fib_nh *nh = &fi->fib_nh[nhsel];
>  
> -			if (nh->nh_flags & RTNH_F_DEAD)
> +			/* allow next-hop to be added if link is down */
> +			if ((nh->nh_flags & RTNH_F_DEAD) &&
> +			    !(fi->fib_flags & RTNH_F_DEAD_LINKDOWN))
>  				continue;
>  			if (flp->flowi4_oif && flp->flowi4_oif != nh->nh_oif)
>  				continue;
> @@ -1829,7 +1835,12 @@ int fib_table_flush(struct fib_table *tb)
>  		hlist_for_each_entry_safe(fa, tmp, &n->leaf, fa_list) {
>  			struct fib_info *fi = fa->fa_info;
>  
> -			if (!fi || !(fi->fib_flags & RTNH_F_DEAD)) {
> +			/* DEAD and DEAD_LINKDOWN will not both be set
> +			 * with IFF_UP is cleared, so do not flush
> +			 * entries with only DEAD set
> +			 */
> +			if (!fi || !((fi->fib_flags & RTNH_F_DEAD) &&
> +			    !(fi->fib_flags & RTNH_F_DEAD_LINKDOWN))) {
>  				slen = fa->fa_slen;
>  				continue;
>  			}

Otherwise looks good, thanks!

Bye,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa June 3, 2015, 2:13 p.m. UTC | #6
On Mi, 2015-06-03 at 09:53 -0400, Neil Horman wrote:
> On Tue, Jun 02, 2015 at 11:07:19PM -0400, Andy Gospodarek wrote:
> > This patch adds the ability to have the Linux kernel track whether or
> > not a particular route should be used based on the link-status of the
> > interface associated with the next-hop.
> > 
> > Before this patch any link-failure on an interface that was serving as a
> > gateway for some systems could result in those systems being isolated
> > from the rest of the network as the stack would continue to attempt to
> > send frames out of an interface that is actually linked-down.  When the
> > kernel is responsible for all forwarding, it should also be responsible
> > for taking action when the traffic can no longer be forwarded -- there
> > is no real need to outsource link-monitoring to userspace anymore.
> > 
> > This feature is only enabled with the new sysctl set (default is off):
> > net.core.kill_routes_on_linkdown = 1
> > 
> > When this is set, the following behavior can be observed (interface p8p1
> > is link-down):
> > 
> > # ip route show 
> > default via 10.0.5.2 dev p9p1 
> > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead 
> > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead 
> > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > # ip route get 90.0.0.1 
> > 90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1 
> >     cache 
> > # ip route get 80.0.0.1 
> > local 80.0.0.1 dev lo  src 80.0.0.1 
> >     cache <local> 
> > # ip route get 80.0.0.2
> > 80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15 
> >     cache 
> > 
> > While the route does remain in the table (so it can be modified if
> > needed rather than being wiped away as it would be if IFF_UP was
> > cleared), the proper next-hop is chosen automatically when the link is
> > down.  Now interface p8p1 is linked-up:
> > 
> > # ip route show 
> > default via 10.0.5.2 dev p9p1 
> > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 
> > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 
> > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > 192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
> > # ip route get 90.0.0.1 
> > 90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1 
> >     cache 
> > # ip route get 80.0.0.1 
> > local 80.0.0.1 dev lo  src 80.0.0.1 
> >     cache <local> 
> > # ip route get 80.0.0.2
> > 80.0.0.2 dev p8p1  src 80.0.0.1 
> >     cache 
> > 
> > and the output changes to what one would expect.
> > 
> > Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
> > Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
> > 
> > ---
> > Though there were some that preferred not to have a configuration option
> > and to make this behavior the default when it was discussed in Ottawa
> > earlier this year since "it was time to do this."  I wanted to propose
> > the config option to preserve the current behavior for those that desire
> > it.  I'll happily remove it if Dave and Linus approve.
> > 
> > An IPv6 implementation is also needed (DECnet too!), but I wanted to
> > start with the IPv4 implementation to get people comfortable with the
> > idea before moving forward.  If this is accepted the IPv6 implementation
> > can be posted shortly.  
> > 
> > FWIW, we have been running this patch with the sysctl setting above and
> > our customers have been happily using a backported version for IPv4 and
> > IPv6 for >6 months.
> > 
> >  include/linux/netdevice.h      |  1 +
> >  include/net/fib_rules.h        |  1 +
> >  include/net/ip_fib.h           |  1 +
> >  include/uapi/linux/rtnetlink.h |  1 +
> >  include/uapi/linux/sysctl.h    |  1 +
> >  kernel/sysctl_binary.c         |  1 +
> >  net/core/dev.c                 |  2 ++
> >  net/core/sysctl_net_core.c     |  7 +++++++
> >  net/ipv4/fib_frontend.c        | 12 +++++++++--
> >  net/ipv4/fib_rules.c           |  7 ++++++-
> >  net/ipv4/fib_semantics.c       | 46 ++++++++++++++++++++++++++++++++++++------
> >  net/ipv4/fib_trie.c            | 19 +++++++++++++----
> >  12 files changed, 86 insertions(+), 13 deletions(-)
> > 
> 
> Hey Andy-
> 	So, the implementation looks great.  That said, a question comes up in
> my mind that I can't quite resolve.  In looking at your code, and seeing how it
> fit into the routing path, I came accross the function sync_fib_down_dev, which
> appears to be called in response to NETDEV_DOWN events.  Its purpose from my
> read is to interrogate the fib for routes that specify the device going down as
> the egress device, and mark said routes as dead.  I'm guessing that I'm missing
> something here, but it seems as though this patch shouldn't be needed in light
> of that behavior (unless the existing code is somehow broken, or I'm confused
> about its purpose).  Could it be that whats really needed here is for Network
> Manager (or other user space programs ) to stop removing routes in response to
> NETDEV_DOWN events so that the above kernel code can do its job?

We don't care about DEAD flag during normal fib lookup, only when
considering multipath routes.

Bye,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman June 3, 2015, 2:21 p.m. UTC | #7
On Wed, Jun 03, 2015 at 04:13:08PM +0200, Hannes Frederic Sowa wrote:
> On Mi, 2015-06-03 at 09:53 -0400, Neil Horman wrote:
> > On Tue, Jun 02, 2015 at 11:07:19PM -0400, Andy Gospodarek wrote:
> > > This patch adds the ability to have the Linux kernel track whether or
> > > not a particular route should be used based on the link-status of the
> > > interface associated with the next-hop.
> > > 
> > > Before this patch any link-failure on an interface that was serving as a
> > > gateway for some systems could result in those systems being isolated
> > > from the rest of the network as the stack would continue to attempt to
> > > send frames out of an interface that is actually linked-down.  When the
> > > kernel is responsible for all forwarding, it should also be responsible
> > > for taking action when the traffic can no longer be forwarded -- there
> > > is no real need to outsource link-monitoring to userspace anymore.
> > > 
> > > This feature is only enabled with the new sysctl set (default is off):
> > > net.core.kill_routes_on_linkdown = 1
> > > 
> > > When this is set, the following behavior can be observed (interface p8p1
> > > is link-down):
> > > 
> > > # ip route show 
> > > default via 10.0.5.2 dev p9p1 
> > > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead 
> > > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead 
> > > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > > # ip route get 90.0.0.1 
> > > 90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1 
> > >     cache 
> > > # ip route get 80.0.0.1 
> > > local 80.0.0.1 dev lo  src 80.0.0.1 
> > >     cache <local> 
> > > # ip route get 80.0.0.2
> > > 80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15 
> > >     cache 
> > > 
> > > While the route does remain in the table (so it can be modified if
> > > needed rather than being wiped away as it would be if IFF_UP was
> > > cleared), the proper next-hop is chosen automatically when the link is
> > > down.  Now interface p8p1 is linked-up:
> > > 
> > > # ip route show 
> > > default via 10.0.5.2 dev p9p1 
> > > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 
> > > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 
> > > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > > 192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
> > > # ip route get 90.0.0.1 
> > > 90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1 
> > >     cache 
> > > # ip route get 80.0.0.1 
> > > local 80.0.0.1 dev lo  src 80.0.0.1 
> > >     cache <local> 
> > > # ip route get 80.0.0.2
> > > 80.0.0.2 dev p8p1  src 80.0.0.1 
> > >     cache 
> > > 
> > > and the output changes to what one would expect.
> > > 
> > > Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
> > > Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
> > > 
> > > ---
> > > Though there were some that preferred not to have a configuration option
> > > and to make this behavior the default when it was discussed in Ottawa
> > > earlier this year since "it was time to do this."  I wanted to propose
> > > the config option to preserve the current behavior for those that desire
> > > it.  I'll happily remove it if Dave and Linus approve.
> > > 
> > > An IPv6 implementation is also needed (DECnet too!), but I wanted to
> > > start with the IPv4 implementation to get people comfortable with the
> > > idea before moving forward.  If this is accepted the IPv6 implementation
> > > can be posted shortly.  
> > > 
> > > FWIW, we have been running this patch with the sysctl setting above and
> > > our customers have been happily using a backported version for IPv4 and
> > > IPv6 for >6 months.
> > > 
> > >  include/linux/netdevice.h      |  1 +
> > >  include/net/fib_rules.h        |  1 +
> > >  include/net/ip_fib.h           |  1 +
> > >  include/uapi/linux/rtnetlink.h |  1 +
> > >  include/uapi/linux/sysctl.h    |  1 +
> > >  kernel/sysctl_binary.c         |  1 +
> > >  net/core/dev.c                 |  2 ++
> > >  net/core/sysctl_net_core.c     |  7 +++++++
> > >  net/ipv4/fib_frontend.c        | 12 +++++++++--
> > >  net/ipv4/fib_rules.c           |  7 ++++++-
> > >  net/ipv4/fib_semantics.c       | 46 ++++++++++++++++++++++++++++++++++++------
> > >  net/ipv4/fib_trie.c            | 19 +++++++++++++----
> > >  12 files changed, 86 insertions(+), 13 deletions(-)
> > > 
> > 
> > Hey Andy-
> > 	So, the implementation looks great.  That said, a question comes up in
> > my mind that I can't quite resolve.  In looking at your code, and seeing how it
> > fit into the routing path, I came accross the function sync_fib_down_dev, which
> > appears to be called in response to NETDEV_DOWN events.  Its purpose from my
> > read is to interrogate the fib for routes that specify the device going down as
> > the egress device, and mark said routes as dead.  I'm guessing that I'm missing
> > something here, but it seems as though this patch shouldn't be needed in light
> > of that behavior (unless the existing code is somehow broken, or I'm confused
> > about its purpose).  Could it be that whats really needed here is for Network
> > Manager (or other user space programs ) to stop removing routes in response to
> > NETDEV_DOWN events so that the above kernel code can do its job?
> 
> We don't care about DEAD flag during normal fib lookup, only when
> considering multipath routes.
> 

I get that, but I think the question still applies.  If you have multiple routes
to the same destination, the sync_fib_down_dev code should still be doing the
job that andy's patch does, shouldn't it?  I.e. if you have two interfaces with
ip addresses on the same subnet, and a route for each interface, only the first
route in the table typically gets used, because its found first.  But if that
link goes down, the sync_fib_down_dev code should mark the links route as dead,
and subsequent lookups should use the alternate interface.  But it seems like
thats not happening, so I'm wondering why that is.

Neil

> Bye,
> Hannes
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Gospodarek June 3, 2015, 2:24 p.m. UTC | #8
On Wed, Jun 03, 2015 at 04:03:48PM +0200, Hannes Frederic Sowa wrote:
> On Di, 2015-06-02 at 23:07 -0400, Andy Gospodarek wrote:
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 6f5f71f..5bd953c 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -2986,6 +2986,7 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
> >  bool is_skb_forwardable(struct net_device *dev, struct sk_buff *skb);
> >  
> >  extern int		netdev_budget;
> > +extern int		kill_routes_on_linkdown;
> 
> Would it make sense to make it per netns?
> 
I had seen it as a global behavior, but per netns could easily be an
option.

> >  
> >  /* Called by rtnetlink.c:rtnl_unlock() */
> >  void netdev_run_todo(void);
> > diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
> > index 6d67383..4fbfda5 100644
> > --- a/include/net/fib_rules.h
> > +++ b/include/net/fib_rules.h
> > @@ -37,6 +37,7 @@ struct fib_lookup_arg {
> >  	struct fib_rule		*rule;
> >  	int			flags;
> >  #define FIB_LOOKUP_NOREF	1
> > +#define FIB_LOOKUP_ALLOWDEAD	2
> >  };
> >  
> >  struct fib_rules_ops {
> > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> > index 54271ed..efb195b 100644
> > --- a/include/net/ip_fib.h
> > +++ b/include/net/ip_fib.h
> > @@ -250,6 +250,7 @@ struct fib_table *fib_new_table(struct net *net, u32 id);
> >  struct fib_table *fib_get_table(struct net *net, u32 id);
> >  
> >  int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res);
> > +int fib_lookup_flags(struct net *n, struct flowi4 *flp, struct fib_result *res, int flags);
> >  
> >  static inline int fib_lookup(struct net *net, struct flowi4 *flp,
> >  			     struct fib_result *res)
> > diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
> > index 17fb02f..bc264d4 100644
> > --- a/include/uapi/linux/rtnetlink.h
> > +++ b/include/uapi/linux/rtnetlink.h
> > @@ -338,6 +338,7 @@ struct rtnexthop {
> >  #define RTNH_F_PERVASIVE	2	/* Do recursive gateway lookup	*/
> >  #define RTNH_F_ONLINK		4	/* Gateway is forced on link	*/
> >  #define RTNH_F_OFFLOAD		8	/* offloaded route */
> > +#define RTNH_F_DEAD_LINKDOWN	16	/* Route dead due to carrier down */
> >  
> >  /* Macros to handle hexthops */
> >  
> 
> -- >8 --
> 
> > diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
> > index 0956373..b6632dc 100644
> > --- a/include/uapi/linux/sysctl.h
> > +++ b/include/uapi/linux/sysctl.h
> > @@ -276,6 +276,7 @@ enum
> >  	NET_CORE_AEVENT_ETIME=20,
> >  	NET_CORE_AEVENT_RSEQTH=21,
> >  	NET_CORE_WARNINGS=22,
> > +	NET_CORE_KILL_ROUTES_ON_LINKDOWN=23,
> >  };
> >  
> >  /* /proc/sys/net/ethernet */
> > diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
> > index 7e7746a..f319116 100644
> > --- a/kernel/sysctl_binary.c
> > +++ b/kernel/sysctl_binary.c
> > @@ -196,6 +196,7 @@ static const struct bin_table bin_net_core_table[] = {
> >  	{ CTL_INT,	NET_CORE_AEVENT_ETIME,	"xfrm_aevent_etime" },
> >  	{ CTL_INT,	NET_CORE_AEVENT_RSEQTH,	"xfrm_aevent_rseqth" },
> >  	{ CTL_INT,	NET_CORE_WARNINGS,	"warnings" },
> > +	{ CTL_INT,	NET_CORE_KILL_ROUTES_ON_LINKDOWN,	"kill_routes_on_linkdown" },
> >  	{},
> >  };
> 
> -- >8 --
> 
> I don't think we need this hunk any more as we don't update binary
> sysctls anymore.
OK

> >  
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 0602e91..50dc19c 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -3167,6 +3167,8 @@ EXPORT_SYMBOL(netdev_max_backlog);
> >  int netdev_tstamp_prequeue __read_mostly = 1;
> >  int netdev_budget __read_mostly = 300;
> >  int weight_p __read_mostly = 64;            /* old backlog weight */
> > +int kill_routes_on_linkdown = 0;
> > +EXPORT_SYMBOL(kill_routes_on_linkdown);
> >  
> >  /* Called with irq disabled */
> >  static inline void ____napi_schedule(struct softnet_data *sd,
> > diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> > index 95b6139..fef1804 100644
> > --- a/net/core/sysctl_net_core.c
> > +++ b/net/core/sysctl_net_core.c
> > @@ -392,6 +392,13 @@ static struct ctl_table net_core_table[] = {
> >  		.mode		= 0644,
> >  		.proc_handler	= proc_dointvec
> >  	},
> > +	{
> > +		.procname	= "kill_routes_on_linkdown",
> > +		.data		= &kill_routes_on_linkdown,
> > +		.maxlen		= sizeof(int),
> > +		.mode		= 0644,
> > +		.proc_handler	= proc_dointvec
> > +	},
> >  	{ }
> >  };
> >  
> > diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> > index 872494e..94348c7 100644
> > --- a/net/ipv4/fib_frontend.c
> > +++ b/net/ipv4/fib_frontend.c
> > @@ -1107,6 +1107,7 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
> >  	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> >  	struct in_device *in_dev;
> >  	struct net *net = dev_net(dev);
> > +	unsigned flags;
> >  
> >  	if (event == NETDEV_UNREGISTER) {
> >  		fib_disable_ip(dev, 2);
> > @@ -1130,10 +1131,17 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
> >  		rt_cache_flush(net);
> >  		break;
> >  	case NETDEV_DOWN:
> > -		fib_disable_ip(dev, 0);
> > +		fib_disable_ip(dev, 1);
> 
> Shouldn't this depend on the sysctl?
> 
See the comment 2 down from this for the reasoning.

> >  		break;
> > -	case NETDEV_CHANGEMTU:
> >  	case NETDEV_CHANGE:
> > +		if (kill_routes_on_linkdown) {
> > +			flags = dev_get_flags(dev);
> > +			if (flags & (IFF_RUNNING|IFF_LOWER_UP))
> > +				fib_sync_up(dev);
> > +			else
> > +				fib_sync_down_dev(dev, 0);
> > +		}
> > +	case NETDEV_CHANGEMTU:
> >  		rt_cache_flush(net);
> >  		break;
> >  	}
> > diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
> > index 5615198..d135ec9 100644
> > --- a/net/ipv4/fib_rules.c
> > +++ b/net/ipv4/fib_rules.c
> > @@ -49,9 +49,14 @@ struct fib4_rule {
> >  
> >  int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res)
> >  {
> > +	return fib_lookup_flags(net, flp, res, 0);
> > +}
> >
> > +int fib_lookup_flags(struct net *net, struct flowi4 *flp, struct fib_result *res, int flags)
> > +{
> >  	struct fib_lookup_arg arg = {
> >  		.result = res,
> > -		.flags = FIB_LOOKUP_NOREF,
> > +		.flags = FIB_LOOKUP_NOREF | flags,
> >  	};
> >  	int err;
> 
> Can fib_lookup take the flags argument always without adding a new
> wrapper?
It could, but __fib_lookup() is an exported symbol, so I reluctant to
change it.

>  
> > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> > index 28ec3c1..c0874ee 100644
> > --- a/net/ipv4/fib_semantics.c
> > +++ b/net/ipv4/fib_semantics.c
> > @@ -266,7 +266,7 @@ static inline int nh_comp(const struct fib_info *fi, const struct fib_info *ofi)
> >  #ifdef CONFIG_IP_ROUTE_CLASSID
> >  		    nh->nh_tclassid != onh->nh_tclassid ||
> >  #endif
> > -		    ((nh->nh_flags ^ onh->nh_flags) & ~RTNH_F_DEAD))
> > +		    ((nh->nh_flags ^ onh->nh_flags) & ~(RTNH_F_DEAD|RTNH_F_DEAD_LINKDOWN)))
> >  			return -1;
> >  		onh++;
> >  	} endfor_nexthops(fi);
> > @@ -318,7 +318,7 @@ static struct fib_info *fib_find_info(const struct fib_info *nfi)
> >  		    nfi->fib_type == fi->fib_type &&
> >  		    memcmp(nfi->fib_metrics, fi->fib_metrics,
> >  			   sizeof(u32) * RTAX_MAX) == 0 &&
> > -		    ((nfi->fib_flags ^ fi->fib_flags) & ~RTNH_F_DEAD) == 0 &&
> > +		    ((nfi->fib_flags ^ fi->fib_flags) & ~(RTNH_F_DEAD|RTNH_F_DEAD_LINKDOWN)) == 0 &&
> >  		    (nfi->fib_nhs == 0 || nh_comp(fi, nfi) == 0))
> >  			return fi;
> >  	}
> > @@ -604,6 +604,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
> >  				return -ENODEV;
> >  			if (!(dev->flags & IFF_UP))
> >  				return -ENETDOWN;
> > +			if (!netif_carrier_ok(dev) && kill_routes_on_linkdown)
> > +				nh->nh_flags |= RTNH_F_DEAD;
> >  			nh->nh_dev = dev;
> >  			dev_hold(dev);
> >  			nh->nh_scope = RT_SCOPE_LINK;
> > @@ -621,7 +623,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
> >  			/* It is not necessary, but requires a bit of thinking */
> >  			if (fl4.flowi4_scope < RT_SCOPE_LINK)
> >  				fl4.flowi4_scope = RT_SCOPE_LINK;
> > -			err = fib_lookup(net, &fl4, &res);
> > +			err = fib_lookup_flags(net, &fl4, &res, FIB_LOOKUP_ALLOWDEAD);
> >  			if (err) {
> >  				rcu_read_unlock();
> >  				return err;
> > @@ -636,6 +638,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
> >  		if (!dev)
> >  			goto out;
> >  		dev_hold(dev);
> > +		if (!netif_carrier_ok(dev) && kill_routes_on_linkdown)
> > +			nh->nh_flags |= RTNH_F_DEAD;
> >  		err = (dev->flags & IFF_UP) ? 0 : -ENETDOWN;
> >  	} else {
> >  		struct in_device *in_dev;
> > @@ -654,6 +658,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
> >  		nh->nh_dev = in_dev->dev;
> >  		dev_hold(nh->nh_dev);
> >  		nh->nh_scope = RT_SCOPE_HOST;
> > +		if (!netif_carrier_ok(nh->nh_dev) && kill_routes_on_linkdown)
> > +			nh->nh_flags |= RTNH_F_DEAD;
> >  		err = 0;
> >  	}
> >  out:
> > @@ -760,6 +766,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
> >  	struct fib_info *ofi;
> >  	int nhs = 1;
> >  	struct net *net = cfg->fc_nlinfo.nl_net;
> > +	int dead;
> >  
> >  	if (cfg->fc_type > RTN_MAX)
> >  		goto err_inval;
> > @@ -920,11 +927,18 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
> >  		if (!nh->nh_dev)
> >  			goto failure;
> >  	} else {
> > +		dead = 0;
> >  		change_nexthops(fi) {
> >  			err = fib_check_nh(cfg, fi, nexthop_nh);
> >  			if (err != 0)
> >  				goto failure;
> > +			if (nexthop_nh->nh_flags & RTNH_F_DEAD)
> > +				dead++;
> >  		} endfor_nexthops(fi)
> > +		if ((dead == fi->fib_nhs)) {
> > +			fi->fib_flags |= RTNH_F_DEAD;
> > +			fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
> > +		}
> >  	}
> >  
> >  	if (fi->fib_prefsrc) {
> > @@ -1097,6 +1111,8 @@ int fib_sync_down_addr(struct net *net, __be32 local)
> >  			continue;
> >  		if (fi->fib_prefsrc == local) {
> >  			fi->fib_flags |= RTNH_F_DEAD;
> > +			/* Addr is gone, more serious than a linkdown */
> > +			fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
> >  			ret++;
> >  		}
> >  	}
> > @@ -1112,7 +1128,7 @@ int fib_sync_down_dev(struct net_device *dev, int force)
> >  	struct hlist_head *head = &fib_info_devhash[hash];
> >  	struct fib_nh *nh;
> >  
> > -	if (force)
> > +	if (force > 1)
> >  		scope = -1;
> 
> Why was this changed? I think force can be made a boolean.
I changed this and changed the call to fib_disable_ip() in
fib_netdev_event() above.  Since previous users of fib_disable_ip only
called force with '0' or '2' today, I added back a call to use 1 and
this change causes the behavior to be the same without the need to
check for the sysctl.  This may be a bit more confusing than necessary,
I would be willing to make it more explicit.

> >  
> >  	hlist_for_each_entry(nh, head, nh_hash) {
> > @@ -1147,6 +1163,11 @@ int fib_sync_down_dev(struct net_device *dev, int force)
> >  		} endfor_nexthops(fi)
> >  		if (dead == fi->fib_nhs) {
> >  			fi->fib_flags |= RTNH_F_DEAD;
> > +			/* force marks route down due to admin down and device removal. */
> > +			if (!force)
> > +				fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
> > +			else
> > +				fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
> >  			ret++;
> >  		}
> >  	}
> > @@ -1223,10 +1244,12 @@ int fib_sync_up(struct net_device *dev)
> >  	struct hlist_head *head;
> >  	struct fib_nh *nh;
> >  	int ret;
> > +	int link_up;
> >  
> >  	if (!(dev->flags & IFF_UP))
> >  		return 0;
> >  
> > +	link_up = netif_carrier_ok(dev);
> >  	prev_fi = NULL;
> >  	hash = fib_devindex_hashfn(dev->ifindex);
> >  	head = &fib_info_devhash[hash];
> > @@ -1253,16 +1276,27 @@ int fib_sync_up(struct net_device *dev)
> >  			if (nexthop_nh->nh_dev != dev ||
> >  			    !__in_dev_get_rtnl(dev))
> >  				continue;
> > -			alive++;
> > +			if (link_up) {
> > +				/* Link is up, so mark NH as alive */
> > +				nexthop_nh->nh_flags &= ~RTNH_F_DEAD;
> > +				alive++;
> > +			} else
> > +				nexthop_nh->nh_flags |= RTNH_F_DEAD;
> > +
> >  			spin_lock_bh(&fib_multipath_lock);
> >  			nexthop_nh->nh_power = 0;
> > -			nexthop_nh->nh_flags &= ~RTNH_F_DEAD;
> >  			spin_unlock_bh(&fib_multipath_lock);
> >  		} endfor_nexthops(fi)
> >  
> >  		if (alive > 0) {
> > +			/* Some NHs are alive, unmark the route as dead */
> > +			fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
> >  			fi->fib_flags &= ~RTNH_F_DEAD;
> >  			ret++;
> > +		} else {
> > +			/* No NHs are alive, mark the route as dead */
> > +			fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
> > +			fi->fib_flags |= RTNH_F_DEAD;
> >  		}
> >  	}
> >  
> > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> > index 01bce15..eedf287 100644
> > --- a/net/ipv4/fib_trie.c
> > +++ b/net/ipv4/fib_trie.c
> > @@ -1401,12 +1401,18 @@ found:
> >  #endif
> >  			return err;
> >  		}
> > -		if (fi->fib_flags & RTNH_F_DEAD)
> > -			continue;
> > +		if (!(fib_flags & FIB_LOOKUP_ALLOWDEAD)) {
> > +			/* if route is dead and link is down, keep looking  */
> > +			if ((fi->fib_flags & RTNH_F_DEAD) &&
> > +			    (fi->fib_flags & RTNH_F_DEAD_LINKDOWN))
> > +				continue;
> > +		}
> >  		for (nhsel = 0; nhsel < fi->fib_nhs; nhsel++) {
> >  			const struct fib_nh *nh = &fi->fib_nh[nhsel];
> >  
> > -			if (nh->nh_flags & RTNH_F_DEAD)
> > +			/* allow next-hop to be added if link is down */
> > +			if ((nh->nh_flags & RTNH_F_DEAD) &&
> > +			    !(fi->fib_flags & RTNH_F_DEAD_LINKDOWN))
> >  				continue;
> >  			if (flp->flowi4_oif && flp->flowi4_oif != nh->nh_oif)
> >  				continue;
> > @@ -1829,7 +1835,12 @@ int fib_table_flush(struct fib_table *tb)
> >  		hlist_for_each_entry_safe(fa, tmp, &n->leaf, fa_list) {
> >  			struct fib_info *fi = fa->fa_info;
> >  
> > -			if (!fi || !(fi->fib_flags & RTNH_F_DEAD)) {
> > +			/* DEAD and DEAD_LINKDOWN will not both be set
> > +			 * with IFF_UP is cleared, so do not flush
> > +			 * entries with only DEAD set
> > +			 */
> > +			if (!fi || !((fi->fib_flags & RTNH_F_DEAD) &&
> > +			    !(fi->fib_flags & RTNH_F_DEAD_LINKDOWN))) {
> >  				slen = fa->fa_slen;
> >  				continue;
> >  			}
> 
> Otherwise looks good, thanks!
Thanks for the review.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa June 3, 2015, 2:45 p.m. UTC | #9
On Mi, 2015-06-03 at 10:21 -0400, Neil Horman wrote:
> On Wed, Jun 03, 2015 at 04:13:08PM +0200, Hannes Frederic Sowa wrote:
> > On Mi, 2015-06-03 at 09:53 -0400, Neil Horman wrote:
> > > On Tue, Jun 02, 2015 at 11:07:19PM -0400, Andy Gospodarek wrote:
> > > > This patch adds the ability to have the Linux kernel track whether or
> > > > not a particular route should be used based on the link-status of the
> > > > interface associated with the next-hop.
> > > > 
> > > > Before this patch any link-failure on an interface that was serving as a
> > > > gateway for some systems could result in those systems being isolated
> > > > from the rest of the network as the stack would continue to attempt to
> > > > send frames out of an interface that is actually linked-down.  When the
> > > > kernel is responsible for all forwarding, it should also be responsible
> > > > for taking action when the traffic can no longer be forwarded -- there
> > > > is no real need to outsource link-monitoring to userspace anymore.
> > > > 
> > > > This feature is only enabled with the new sysctl set (default is off):
> > > > net.core.kill_routes_on_linkdown = 1
> > > > 
> > > > When this is set, the following behavior can be observed (interface p8p1
> > > > is link-down):
> > > > 
> > > > # ip route show 
> > > > default via 10.0.5.2 dev p9p1 
> > > > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > > > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > > > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead 
> > > > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead 
> > > > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > > > # ip route get 90.0.0.1 
> > > > 90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1 
> > > >     cache 
> > > > # ip route get 80.0.0.1 
> > > > local 80.0.0.1 dev lo  src 80.0.0.1 
> > > >     cache <local> 
> > > > # ip route get 80.0.0.2
> > > > 80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15 
> > > >     cache 
> > > > 
> > > > While the route does remain in the table (so it can be modified if
> > > > needed rather than being wiped away as it would be if IFF_UP was
> > > > cleared), the proper next-hop is chosen automatically when the link is
> > > > down.  Now interface p8p1 is linked-up:
> > > > 
> > > > # ip route show 
> > > > default via 10.0.5.2 dev p9p1 
> > > > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > > > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > > > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 
> > > > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 
> > > > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > > > 192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
> > > > # ip route get 90.0.0.1 
> > > > 90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1 
> > > >     cache 
> > > > # ip route get 80.0.0.1 
> > > > local 80.0.0.1 dev lo  src 80.0.0.1 
> > > >     cache <local> 
> > > > # ip route get 80.0.0.2
> > > > 80.0.0.2 dev p8p1  src 80.0.0.1 
> > > >     cache 
> > > > 
> > > > and the output changes to what one would expect.
> > > > 
> > > > Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
> > > > Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
> > > > 
> > > > ---
> > > > Though there were some that preferred not to have a configuration option
> > > > and to make this behavior the default when it was discussed in Ottawa
> > > > earlier this year since "it was time to do this."  I wanted to propose
> > > > the config option to preserve the current behavior for those that desire
> > > > it.  I'll happily remove it if Dave and Linus approve.
> > > > 
> > > > An IPv6 implementation is also needed (DECnet too!), but I wanted to
> > > > start with the IPv4 implementation to get people comfortable with the
> > > > idea before moving forward.  If this is accepted the IPv6 implementation
> > > > can be posted shortly.  
> > > > 
> > > > FWIW, we have been running this patch with the sysctl setting above and
> > > > our customers have been happily using a backported version for IPv4 and
> > > > IPv6 for >6 months.
> > > > 
> > > >  include/linux/netdevice.h      |  1 +
> > > >  include/net/fib_rules.h        |  1 +
> > > >  include/net/ip_fib.h           |  1 +
> > > >  include/uapi/linux/rtnetlink.h |  1 +
> > > >  include/uapi/linux/sysctl.h    |  1 +
> > > >  kernel/sysctl_binary.c         |  1 +
> > > >  net/core/dev.c                 |  2 ++
> > > >  net/core/sysctl_net_core.c     |  7 +++++++
> > > >  net/ipv4/fib_frontend.c        | 12 +++++++++--
> > > >  net/ipv4/fib_rules.c           |  7 ++++++-
> > > >  net/ipv4/fib_semantics.c       | 46 ++++++++++++++++++++++++++++++++++++------
> > > >  net/ipv4/fib_trie.c            | 19 +++++++++++++----
> > > >  12 files changed, 86 insertions(+), 13 deletions(-)
> > > > 
> > > 
> > > Hey Andy-
> > > 	So, the implementation looks great.  That said, a question comes up in
> > > my mind that I can't quite resolve.  In looking at your code, and seeing how it
> > > fit into the routing path, I came accross the function sync_fib_down_dev, which
> > > appears to be called in response to NETDEV_DOWN events.  Its purpose from my
> > > read is to interrogate the fib for routes that specify the device going down as
> > > the egress device, and mark said routes as dead.  I'm guessing that I'm missing
> > > something here, but it seems as though this patch shouldn't be needed in light
> > > of that behavior (unless the existing code is somehow broken, or I'm confused
> > > about its purpose).  Could it be that whats really needed here is for Network
> > > Manager (or other user space programs ) to stop removing routes in response to
> > > NETDEV_DOWN events so that the above kernel code can do its job?
> > 
> > We don't care about DEAD flag during normal fib lookup, only when
> > considering multipath routes.
> > 
> 
> I get that, but I think the question still applies.  If you have multiple routes
> to the same destination, the sync_fib_down_dev code should still be doing the
> job that andy's patch does, shouldn't it?  I.e. if you have two interfaces with
> ip addresses on the same subnet, and a route for each interface, only the first
> route in the table typically gets used, because its found first.  But if that
> link goes down, the sync_fib_down_dev code should mark the links route as dead,
> and subsequent lookups should use the alternate interface.  But it seems like
> thats not happening, so I'm wondering why that is.

It is a matter of implementation, I think, albeit I would prefer to not
kill all connections during a small link loss.

Given a laptop with unstable link, we would immediately kill all
connections during packet send with this option enabled as error would
propagate to local_error handlers (I hope this is correct, Andy?).

If you have multiple routes to the same destination as in multipath,
yes, this should work and I think will work.
But as soon as you have to backtrack, this is not a semantically
equivalent route to the destination any more, so we should not switch to
it just because of a link event.

I don't think we can change the default easily here, but I like this new
feature.

Bye,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Gospodarek June 3, 2015, 2:46 p.m. UTC | #10
On Wed, Jun 03, 2015 at 10:21:31AM -0400, Neil Horman wrote:
> On Wed, Jun 03, 2015 at 04:13:08PM +0200, Hannes Frederic Sowa wrote:
> > On Mi, 2015-06-03 at 09:53 -0400, Neil Horman wrote:
> > > On Tue, Jun 02, 2015 at 11:07:19PM -0400, Andy Gospodarek wrote:
> > > > This patch adds the ability to have the Linux kernel track whether or
> > > > not a particular route should be used based on the link-status of the
> > > > interface associated with the next-hop.
> > > > 
> > > > Before this patch any link-failure on an interface that was serving as a
> > > > gateway for some systems could result in those systems being isolated
> > > > from the rest of the network as the stack would continue to attempt to
> > > > send frames out of an interface that is actually linked-down.  When the
> > > > kernel is responsible for all forwarding, it should also be responsible
> > > > for taking action when the traffic can no longer be forwarded -- there
> > > > is no real need to outsource link-monitoring to userspace anymore.
> > > > 
> > > > This feature is only enabled with the new sysctl set (default is off):
> > > > net.core.kill_routes_on_linkdown = 1
> > > > 
> > > > When this is set, the following behavior can be observed (interface p8p1
> > > > is link-down):
> > > > 
> > > > # ip route show 
> > > > default via 10.0.5.2 dev p9p1 
> > > > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > > > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > > > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead 
> > > > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead 
> > > > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > > > # ip route get 90.0.0.1 
> > > > 90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1 
> > > >     cache 
> > > > # ip route get 80.0.0.1 
> > > > local 80.0.0.1 dev lo  src 80.0.0.1 
> > > >     cache <local> 
> > > > # ip route get 80.0.0.2
> > > > 80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15 
> > > >     cache 
> > > > 
> > > > While the route does remain in the table (so it can be modified if
> > > > needed rather than being wiped away as it would be if IFF_UP was
> > > > cleared), the proper next-hop is chosen automatically when the link is
> > > > down.  Now interface p8p1 is linked-up:
> > > > 
> > > > # ip route show 
> > > > default via 10.0.5.2 dev p9p1 
> > > > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > > > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > > > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 
> > > > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 
> > > > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > > > 192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
> > > > # ip route get 90.0.0.1 
> > > > 90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1 
> > > >     cache 
> > > > # ip route get 80.0.0.1 
> > > > local 80.0.0.1 dev lo  src 80.0.0.1 
> > > >     cache <local> 
> > > > # ip route get 80.0.0.2
> > > > 80.0.0.2 dev p8p1  src 80.0.0.1 
> > > >     cache 
> > > > 
> > > > and the output changes to what one would expect.
> > > > 
> > > > Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
> > > > Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
> > > > 
> > > > ---
> > > > Though there were some that preferred not to have a configuration option
> > > > and to make this behavior the default when it was discussed in Ottawa
> > > > earlier this year since "it was time to do this."  I wanted to propose
> > > > the config option to preserve the current behavior for those that desire
> > > > it.  I'll happily remove it if Dave and Linus approve.
> > > > 
> > > > An IPv6 implementation is also needed (DECnet too!), but I wanted to
> > > > start with the IPv4 implementation to get people comfortable with the
> > > > idea before moving forward.  If this is accepted the IPv6 implementation
> > > > can be posted shortly.  
> > > > 
> > > > FWIW, we have been running this patch with the sysctl setting above and
> > > > our customers have been happily using a backported version for IPv4 and
> > > > IPv6 for >6 months.
> > > > 
> > > >  include/linux/netdevice.h      |  1 +
> > > >  include/net/fib_rules.h        |  1 +
> > > >  include/net/ip_fib.h           |  1 +
> > > >  include/uapi/linux/rtnetlink.h |  1 +
> > > >  include/uapi/linux/sysctl.h    |  1 +
> > > >  kernel/sysctl_binary.c         |  1 +
> > > >  net/core/dev.c                 |  2 ++
> > > >  net/core/sysctl_net_core.c     |  7 +++++++
> > > >  net/ipv4/fib_frontend.c        | 12 +++++++++--
> > > >  net/ipv4/fib_rules.c           |  7 ++++++-
> > > >  net/ipv4/fib_semantics.c       | 46 ++++++++++++++++++++++++++++++++++++------
> > > >  net/ipv4/fib_trie.c            | 19 +++++++++++++----
> > > >  12 files changed, 86 insertions(+), 13 deletions(-)
> > > > 
> > > 
> > > Hey Andy-
> > > 	So, the implementation looks great.  That said, a question comes up in
> > > my mind that I can't quite resolve.  In looking at your code, and seeing how it
> > > fit into the routing path, I came accross the function sync_fib_down_dev, which
> > > appears to be called in response to NETDEV_DOWN events.  Its purpose from my
> > > read is to interrogate the fib for routes that specify the device going down as
> > > the egress device, and mark said routes as dead.  I'm guessing that I'm missing
> > > something here, but it seems as though this patch shouldn't be needed in light
> > > of that behavior (unless the existing code is somehow broken, or I'm confused
> > > about its purpose).  Could it be that whats really needed here is for Network
> > > Manager (or other user space programs ) to stop removing routes in response to
> > > NETDEV_DOWN events so that the above kernel code can do its job?
> > 
> > We don't care about DEAD flag during normal fib lookup, only when
> > considering multipath routes.
> > 
> 
> I get that, but I think the question still applies.  If you have multiple routes
> to the same destination, the sync_fib_down_dev code should still be doing the
> job that andy's patch does, shouldn't it?  I.e. if you have two interfaces with
> ip addresses on the same subnet, and a route for each interface, only the first
> route in the table typically gets used, because its found first.  But if that
> link goes down, the sync_fib_down_dev code should mark the links route as dead,
> and subsequent lookups should use the alternate interface.  But it seems like
> thats not happening, so I'm wondering why that is.

Neil, I think I see what you are asking, let me see if I can clarify
this.  The confusing part here is that NETDEV_DOWN events are done when
IFF_UP is cleared, not when link changes (those are NETDEV_CHANGE
events), but sync_fib_down_dev() is called by both with this patch.

When NETDEV_DOWN events happen the DEAD_LINKDOWN flag needs to be
cleared so that proper cleanup happens later when flushing routes later.
This is why some action is needed in sync_fib_down_dev to handle this
new flag.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman June 3, 2015, 3:02 p.m. UTC | #11
On Wed, Jun 03, 2015 at 10:46:22AM -0400, Andy Gospodarek wrote:
> On Wed, Jun 03, 2015 at 10:21:31AM -0400, Neil Horman wrote:
> > On Wed, Jun 03, 2015 at 04:13:08PM +0200, Hannes Frederic Sowa wrote:
> > > On Mi, 2015-06-03 at 09:53 -0400, Neil Horman wrote:
> > > > On Tue, Jun 02, 2015 at 11:07:19PM -0400, Andy Gospodarek wrote:
> > > > > This patch adds the ability to have the Linux kernel track whether or
> > > > > not a particular route should be used based on the link-status of the
> > > > > interface associated with the next-hop.
> > > > > 
> > > > > Before this patch any link-failure on an interface that was serving as a
> > > > > gateway for some systems could result in those systems being isolated
> > > > > from the rest of the network as the stack would continue to attempt to
> > > > > send frames out of an interface that is actually linked-down.  When the
> > > > > kernel is responsible for all forwarding, it should also be responsible
> > > > > for taking action when the traffic can no longer be forwarded -- there
> > > > > is no real need to outsource link-monitoring to userspace anymore.
> > > > > 
> > > > > This feature is only enabled with the new sysctl set (default is off):
> > > > > net.core.kill_routes_on_linkdown = 1
> > > > > 
> > > > > When this is set, the following behavior can be observed (interface p8p1
> > > > > is link-down):
> > > > > 
> > > > > # ip route show 
> > > > > default via 10.0.5.2 dev p9p1 
> > > > > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > > > > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > > > > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead 
> > > > > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead 
> > > > > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > > > > # ip route get 90.0.0.1 
> > > > > 90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1 
> > > > >     cache 
> > > > > # ip route get 80.0.0.1 
> > > > > local 80.0.0.1 dev lo  src 80.0.0.1 
> > > > >     cache <local> 
> > > > > # ip route get 80.0.0.2
> > > > > 80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15 
> > > > >     cache 
> > > > > 
> > > > > While the route does remain in the table (so it can be modified if
> > > > > needed rather than being wiped away as it would be if IFF_UP was
> > > > > cleared), the proper next-hop is chosen automatically when the link is
> > > > > down.  Now interface p8p1 is linked-up:
> > > > > 
> > > > > # ip route show 
> > > > > default via 10.0.5.2 dev p9p1 
> > > > > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > > > > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > > > > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 
> > > > > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 
> > > > > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > > > > 192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
> > > > > # ip route get 90.0.0.1 
> > > > > 90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1 
> > > > >     cache 
> > > > > # ip route get 80.0.0.1 
> > > > > local 80.0.0.1 dev lo  src 80.0.0.1 
> > > > >     cache <local> 
> > > > > # ip route get 80.0.0.2
> > > > > 80.0.0.2 dev p8p1  src 80.0.0.1 
> > > > >     cache 
> > > > > 
> > > > > and the output changes to what one would expect.
> > > > > 
> > > > > Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
> > > > > Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
> > > > > 
> > > > > ---
> > > > > Though there were some that preferred not to have a configuration option
> > > > > and to make this behavior the default when it was discussed in Ottawa
> > > > > earlier this year since "it was time to do this."  I wanted to propose
> > > > > the config option to preserve the current behavior for those that desire
> > > > > it.  I'll happily remove it if Dave and Linus approve.
> > > > > 
> > > > > An IPv6 implementation is also needed (DECnet too!), but I wanted to
> > > > > start with the IPv4 implementation to get people comfortable with the
> > > > > idea before moving forward.  If this is accepted the IPv6 implementation
> > > > > can be posted shortly.  
> > > > > 
> > > > > FWIW, we have been running this patch with the sysctl setting above and
> > > > > our customers have been happily using a backported version for IPv4 and
> > > > > IPv6 for >6 months.
> > > > > 
> > > > >  include/linux/netdevice.h      |  1 +
> > > > >  include/net/fib_rules.h        |  1 +
> > > > >  include/net/ip_fib.h           |  1 +
> > > > >  include/uapi/linux/rtnetlink.h |  1 +
> > > > >  include/uapi/linux/sysctl.h    |  1 +
> > > > >  kernel/sysctl_binary.c         |  1 +
> > > > >  net/core/dev.c                 |  2 ++
> > > > >  net/core/sysctl_net_core.c     |  7 +++++++
> > > > >  net/ipv4/fib_frontend.c        | 12 +++++++++--
> > > > >  net/ipv4/fib_rules.c           |  7 ++++++-
> > > > >  net/ipv4/fib_semantics.c       | 46 ++++++++++++++++++++++++++++++++++++------
> > > > >  net/ipv4/fib_trie.c            | 19 +++++++++++++----
> > > > >  12 files changed, 86 insertions(+), 13 deletions(-)
> > > > > 
> > > > 
> > > > Hey Andy-
> > > > 	So, the implementation looks great.  That said, a question comes up in
> > > > my mind that I can't quite resolve.  In looking at your code, and seeing how it
> > > > fit into the routing path, I came accross the function sync_fib_down_dev, which
> > > > appears to be called in response to NETDEV_DOWN events.  Its purpose from my
> > > > read is to interrogate the fib for routes that specify the device going down as
> > > > the egress device, and mark said routes as dead.  I'm guessing that I'm missing
> > > > something here, but it seems as though this patch shouldn't be needed in light
> > > > of that behavior (unless the existing code is somehow broken, or I'm confused
> > > > about its purpose).  Could it be that whats really needed here is for Network
> > > > Manager (or other user space programs ) to stop removing routes in response to
> > > > NETDEV_DOWN events so that the above kernel code can do its job?
> > > 
> > > We don't care about DEAD flag during normal fib lookup, only when
> > > considering multipath routes.
> > > 
> > 
> > I get that, but I think the question still applies.  If you have multiple routes
> > to the same destination, the sync_fib_down_dev code should still be doing the
> > job that andy's patch does, shouldn't it?  I.e. if you have two interfaces with
> > ip addresses on the same subnet, and a route for each interface, only the first
> > route in the table typically gets used, because its found first.  But if that
> > link goes down, the sync_fib_down_dev code should mark the links route as dead,
> > and subsequent lookups should use the alternate interface.  But it seems like
> > thats not happening, so I'm wondering why that is.
> 
> Neil, I think I see what you are asking, let me see if I can clarify
> this.  The confusing part here is that NETDEV_DOWN events are done when
> IFF_UP is cleared, not when link changes (those are NETDEV_CHANGE
> events), but sync_fib_down_dev() is called by both with this patch.
> 
Ok, I can see that, but shouldn't we get IFF_UP cleared when we loose link
anyway? I.e. if we loose link, we loose carrier, and as a result we get a
linkwatch event that takes down the interface, no?

> When NETDEV_DOWN events happen the DEAD_LINKDOWN flag needs to be
> cleared so that proper cleanup happens later when flushing routes later.
> This is why some action is needed in sync_fib_down_dev to handle this
> new flag.
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Gospodarek June 3, 2015, 3:12 p.m. UTC | #12
On Tue, Jun 02, 2015 at 10:03:23PM -0700, Scott Feldman wrote:
> On Tue, Jun 2, 2015 at 8:07 PM, Andy Gospodarek
> <gospo@cumulusnetworks.com> wrote:
> > This patch adds the ability to have the Linux kernel track whether or
> > not a particular route should be used based on the link-status of the
> > interface associated with the next-hop.
> >
> > Before this patch any link-failure on an interface that was serving as a
> > gateway for some systems could result in those systems being isolated
> > from the rest of the network as the stack would continue to attempt to
> > send frames out of an interface that is actually linked-down.  When the
> > kernel is responsible for all forwarding, it should also be responsible
> > for taking action when the traffic can no longer be forwarded -- there
> > is no real need to outsource link-monitoring to userspace anymore.
> 
> Hi Andy, how does this work for the hardware offload case?  I'm not
> seeing how hardware gets updated when the software route is updated.
> Seems hardware isn't updated and would send frames out the downed nh
> interface.

Scott, you are correct that this does not currently address the hardware
offload case.  If one wanted offloaded hardware to reflect this change,
then there needs to be either:

- The ability to communicate the flag changes down to offload devices in
  the event that they want some action to take place
  switchdev_fib_ipv4_modify(), maybe? (Not to derail this convo too much
  but it will soon be time to consider what to do at the switchdev layer
  when route flags are modified like when the RTNH_F_OFFLOAD flag is
  removed by the user if they feel the route does not need to be
  offloaded for any reason.)

- Register a register_netdevice_notifier and take action at the
  switchdev layer to make sure that routes can be removed or marked as
  inactive when link goes down.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Gospodarek June 3, 2015, 3:16 p.m. UTC | #13
On Wed, Jun 03, 2015 at 11:02:23AM -0400, Neil Horman wrote:
> On Wed, Jun 03, 2015 at 10:46:22AM -0400, Andy Gospodarek wrote:
> > On Wed, Jun 03, 2015 at 10:21:31AM -0400, Neil Horman wrote:
> > > On Wed, Jun 03, 2015 at 04:13:08PM +0200, Hannes Frederic Sowa wrote:
> > > > On Mi, 2015-06-03 at 09:53 -0400, Neil Horman wrote:
> > > > > On Tue, Jun 02, 2015 at 11:07:19PM -0400, Andy Gospodarek wrote:
> > > > > > This patch adds the ability to have the Linux kernel track whether or
> > > > > > not a particular route should be used based on the link-status of the
> > > > > > interface associated with the next-hop.
> > > > > > 
> > > > > > Before this patch any link-failure on an interface that was serving as a
> > > > > > gateway for some systems could result in those systems being isolated
> > > > > > from the rest of the network as the stack would continue to attempt to
> > > > > > send frames out of an interface that is actually linked-down.  When the
> > > > > > kernel is responsible for all forwarding, it should also be responsible
> > > > > > for taking action when the traffic can no longer be forwarded -- there
> > > > > > is no real need to outsource link-monitoring to userspace anymore.
> > > > > > 
> > > > > > This feature is only enabled with the new sysctl set (default is off):
> > > > > > net.core.kill_routes_on_linkdown = 1
> > > > > > 
> > > > > > When this is set, the following behavior can be observed (interface p8p1
> > > > > > is link-down):
> > > > > > 
> > > > > > # ip route show 
> > > > > > default via 10.0.5.2 dev p9p1 
> > > > > > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > > > > > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > > > > > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead 
> > > > > > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead 
> > > > > > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > > > > > # ip route get 90.0.0.1 
> > > > > > 90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1 
> > > > > >     cache 
> > > > > > # ip route get 80.0.0.1 
> > > > > > local 80.0.0.1 dev lo  src 80.0.0.1 
> > > > > >     cache <local> 
> > > > > > # ip route get 80.0.0.2
> > > > > > 80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15 
> > > > > >     cache 
> > > > > > 
> > > > > > While the route does remain in the table (so it can be modified if
> > > > > > needed rather than being wiped away as it would be if IFF_UP was
> > > > > > cleared), the proper next-hop is chosen automatically when the link is
> > > > > > down.  Now interface p8p1 is linked-up:
> > > > > > 
> > > > > > # ip route show 
> > > > > > default via 10.0.5.2 dev p9p1 
> > > > > > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > > > > > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > > > > > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 
> > > > > > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 
> > > > > > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > > > > > 192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
> > > > > > # ip route get 90.0.0.1 
> > > > > > 90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1 
> > > > > >     cache 
> > > > > > # ip route get 80.0.0.1 
> > > > > > local 80.0.0.1 dev lo  src 80.0.0.1 
> > > > > >     cache <local> 
> > > > > > # ip route get 80.0.0.2
> > > > > > 80.0.0.2 dev p8p1  src 80.0.0.1 
> > > > > >     cache 
> > > > > > 
> > > > > > and the output changes to what one would expect.
> > > > > > 
> > > > > > Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
> > > > > > Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
> > > > > > 
> > > > > > ---
> > > > > > Though there were some that preferred not to have a configuration option
> > > > > > and to make this behavior the default when it was discussed in Ottawa
> > > > > > earlier this year since "it was time to do this."  I wanted to propose
> > > > > > the config option to preserve the current behavior for those that desire
> > > > > > it.  I'll happily remove it if Dave and Linus approve.
> > > > > > 
> > > > > > An IPv6 implementation is also needed (DECnet too!), but I wanted to
> > > > > > start with the IPv4 implementation to get people comfortable with the
> > > > > > idea before moving forward.  If this is accepted the IPv6 implementation
> > > > > > can be posted shortly.  
> > > > > > 
> > > > > > FWIW, we have been running this patch with the sysctl setting above and
> > > > > > our customers have been happily using a backported version for IPv4 and
> > > > > > IPv6 for >6 months.
> > > > > > 
> > > > > >  include/linux/netdevice.h      |  1 +
> > > > > >  include/net/fib_rules.h        |  1 +
> > > > > >  include/net/ip_fib.h           |  1 +
> > > > > >  include/uapi/linux/rtnetlink.h |  1 +
> > > > > >  include/uapi/linux/sysctl.h    |  1 +
> > > > > >  kernel/sysctl_binary.c         |  1 +
> > > > > >  net/core/dev.c                 |  2 ++
> > > > > >  net/core/sysctl_net_core.c     |  7 +++++++
> > > > > >  net/ipv4/fib_frontend.c        | 12 +++++++++--
> > > > > >  net/ipv4/fib_rules.c           |  7 ++++++-
> > > > > >  net/ipv4/fib_semantics.c       | 46 ++++++++++++++++++++++++++++++++++++------
> > > > > >  net/ipv4/fib_trie.c            | 19 +++++++++++++----
> > > > > >  12 files changed, 86 insertions(+), 13 deletions(-)
> > > > > > 
> > > > > 
> > > > > Hey Andy-
> > > > > 	So, the implementation looks great.  That said, a question comes up in
> > > > > my mind that I can't quite resolve.  In looking at your code, and seeing how it
> > > > > fit into the routing path, I came accross the function sync_fib_down_dev, which
> > > > > appears to be called in response to NETDEV_DOWN events.  Its purpose from my
> > > > > read is to interrogate the fib for routes that specify the device going down as
> > > > > the egress device, and mark said routes as dead.  I'm guessing that I'm missing
> > > > > something here, but it seems as though this patch shouldn't be needed in light
> > > > > of that behavior (unless the existing code is somehow broken, or I'm confused
> > > > > about its purpose).  Could it be that whats really needed here is for Network
> > > > > Manager (or other user space programs ) to stop removing routes in response to
> > > > > NETDEV_DOWN events so that the above kernel code can do its job?
> > > > 
> > > > We don't care about DEAD flag during normal fib lookup, only when
> > > > considering multipath routes.
> > > > 
> > > 
> > > I get that, but I think the question still applies.  If you have multiple routes
> > > to the same destination, the sync_fib_down_dev code should still be doing the
> > > job that andy's patch does, shouldn't it?  I.e. if you have two interfaces with
> > > ip addresses on the same subnet, and a route for each interface, only the first
> > > route in the table typically gets used, because its found first.  But if that
> > > link goes down, the sync_fib_down_dev code should mark the links route as dead,
> > > and subsequent lookups should use the alternate interface.  But it seems like
> > > thats not happening, so I'm wondering why that is.
> > 
> > Neil, I think I see what you are asking, let me see if I can clarify
> > this.  The confusing part here is that NETDEV_DOWN events are done when
> > IFF_UP is cleared, not when link changes (those are NETDEV_CHANGE
> > events), but sync_fib_down_dev() is called by both with this patch.
> > 
> Ok, I can see that, but shouldn't we get IFF_UP cleared when we loose link
> anyway? I.e. if we loose link, we loose carrier, and as a result we get a
> linkwatch event that takes down the interface, no?

operstate is cleared by linkwatch (so 'ip link' output shows
'NO-CARRIER' vs 'LOWER_UP'), but IFF_UP is not.  

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa June 3, 2015, 3:17 p.m. UTC | #14
On Mi, 2015-06-03 at 11:02 -0400, Neil Horman wrote:
> On Wed, Jun 03, 2015 at 10:46:22AM -0400, Andy Gospodarek wrote:
> > On Wed, Jun 03, 2015 at 10:21:31AM -0400, Neil Horman wrote:
> > > On Wed, Jun 03, 2015 at 04:13:08PM +0200, Hannes Frederic Sowa wrote:
> > > > On Mi, 2015-06-03 at 09:53 -0400, Neil Horman wrote:
> > > > > On Tue, Jun 02, 2015 at 11:07:19PM -0400, Andy Gospodarek wrote:
> > > > > > This patch adds the ability to have the Linux kernel track whether or
> > > > > > not a particular route should be used based on the link-status of the
> > > > > > interface associated with the next-hop.
> > > > > > 
> > > > > > Before this patch any link-failure on an interface that was serving as a
> > > > > > gateway for some systems could result in those systems being isolated
> > > > > > from the rest of the network as the stack would continue to attempt to
> > > > > > send frames out of an interface that is actually linked-down.  When the
> > > > > > kernel is responsible for all forwarding, it should also be responsible
> > > > > > for taking action when the traffic can no longer be forwarded -- there
> > > > > > is no real need to outsource link-monitoring to userspace anymore.
> > > > > > 
> > > > > > This feature is only enabled with the new sysctl set (default is off):
> > > > > > net.core.kill_routes_on_linkdown = 1
> > > > > > 
> > > > > > When this is set, the following behavior can be observed (interface p8p1
> > > > > > is link-down):
> > > > > > 
> > > > > > # ip route show 
> > > > > > default via 10.0.5.2 dev p9p1 
> > > > > > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > > > > > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > > > > > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead 
> > > > > > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead 
> > > > > > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > > > > > # ip route get 90.0.0.1 
> > > > > > 90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1 
> > > > > >     cache 
> > > > > > # ip route get 80.0.0.1 
> > > > > > local 80.0.0.1 dev lo  src 80.0.0.1 
> > > > > >     cache <local> 
> > > > > > # ip route get 80.0.0.2
> > > > > > 80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15 
> > > > > >     cache 
> > > > > > 
> > > > > > While the route does remain in the table (so it can be modified if
> > > > > > needed rather than being wiped away as it would be if IFF_UP was
> > > > > > cleared), the proper next-hop is chosen automatically when the link is
> > > > > > down.  Now interface p8p1 is linked-up:
> > > > > > 
> > > > > > # ip route show 
> > > > > > default via 10.0.5.2 dev p9p1 
> > > > > > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > > > > > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > > > > > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 
> > > > > > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 
> > > > > > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > > > > > 192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
> > > > > > # ip route get 90.0.0.1 
> > > > > > 90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1 
> > > > > >     cache 
> > > > > > # ip route get 80.0.0.1 
> > > > > > local 80.0.0.1 dev lo  src 80.0.0.1 
> > > > > >     cache <local> 
> > > > > > # ip route get 80.0.0.2
> > > > > > 80.0.0.2 dev p8p1  src 80.0.0.1 
> > > > > >     cache 
> > > > > > 
> > > > > > and the output changes to what one would expect.
> > > > > > 
> > > > > > Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
> > > > > > Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
> > > > > > 
> > > > > > ---
> > > > > > Though there were some that preferred not to have a configuration option
> > > > > > and to make this behavior the default when it was discussed in Ottawa
> > > > > > earlier this year since "it was time to do this."  I wanted to propose
> > > > > > the config option to preserve the current behavior for those that desire
> > > > > > it.  I'll happily remove it if Dave and Linus approve.
> > > > > > 
> > > > > > An IPv6 implementation is also needed (DECnet too!), but I wanted to
> > > > > > start with the IPv4 implementation to get people comfortable with the
> > > > > > idea before moving forward.  If this is accepted the IPv6 implementation
> > > > > > can be posted shortly.  
> > > > > > 
> > > > > > FWIW, we have been running this patch with the sysctl setting above and
> > > > > > our customers have been happily using a backported version for IPv4 and
> > > > > > IPv6 for >6 months.
> > > > > > 
> > > > > >  include/linux/netdevice.h      |  1 +
> > > > > >  include/net/fib_rules.h        |  1 +
> > > > > >  include/net/ip_fib.h           |  1 +
> > > > > >  include/uapi/linux/rtnetlink.h |  1 +
> > > > > >  include/uapi/linux/sysctl.h    |  1 +
> > > > > >  kernel/sysctl_binary.c         |  1 +
> > > > > >  net/core/dev.c                 |  2 ++
> > > > > >  net/core/sysctl_net_core.c     |  7 +++++++
> > > > > >  net/ipv4/fib_frontend.c        | 12 +++++++++--
> > > > > >  net/ipv4/fib_rules.c           |  7 ++++++-
> > > > > >  net/ipv4/fib_semantics.c       | 46 ++++++++++++++++++++++++++++++++++++------
> > > > > >  net/ipv4/fib_trie.c            | 19 +++++++++++++----
> > > > > >  12 files changed, 86 insertions(+), 13 deletions(-)
> > > > > > 
> > > > > 
> > > > > Hey Andy-
> > > > > 	So, the implementation looks great.  That said, a question comes up in
> > > > > my mind that I can't quite resolve.  In looking at your code, and seeing how it
> > > > > fit into the routing path, I came accross the function sync_fib_down_dev, which
> > > > > appears to be called in response to NETDEV_DOWN events.  Its purpose from my
> > > > > read is to interrogate the fib for routes that specify the device going down as
> > > > > the egress device, and mark said routes as dead.  I'm guessing that I'm missing
> > > > > something here, but it seems as though this patch shouldn't be needed in light
> > > > > of that behavior (unless the existing code is somehow broken, or I'm confused
> > > > > about its purpose).  Could it be that whats really needed here is for Network
> > > > > Manager (or other user space programs ) to stop removing routes in response to
> > > > > NETDEV_DOWN events so that the above kernel code can do its job?
> > > > 
> > > > We don't care about DEAD flag during normal fib lookup, only when
> > > > considering multipath routes.
> > > > 
> > > 
> > > I get that, but I think the question still applies.  If you have multiple routes
> > > to the same destination, the sync_fib_down_dev code should still be doing the
> > > job that andy's patch does, shouldn't it?  I.e. if you have two interfaces with
> > > ip addresses on the same subnet, and a route for each interface, only the first
> > > route in the table typically gets used, because its found first.  But if that
> > > link goes down, the sync_fib_down_dev code should mark the links route as dead,
> > > and subsequent lookups should use the alternate interface.  But it seems like
> > > thats not happening, so I'm wondering why that is.
> > 
> > Neil, I think I see what you are asking, let me see if I can clarify
> > this.  The confusing part here is that NETDEV_DOWN events are done when
> > IFF_UP is cleared, not when link changes (those are NETDEV_CHANGE
> > events), but sync_fib_down_dev() is called by both with this patch.
> > 
> Ok, I can see that, but shouldn't we get IFF_UP cleared when we loose link
> anyway? I.e. if we loose link, we loose carrier, and as a result we get a
> linkwatch event that takes down the interface, no?

I consider IFF_UP and carrier related events orthogonal.

Your proposal can be implemented by user space policy, of course, or
routing protocols/daemons (which act on behalf a policy), but I don't
want the kernel to decide that by default. If the kernel would have done
so all the time, fine, we would have experience with that. But changing
this behavior now seems to be problematic to me.

Imaging unplugging a switch and suddenly your router sending out
internal network traffic to the default route...

Bye,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Scott Feldman June 3, 2015, 5:40 p.m. UTC | #15
On Wed, Jun 3, 2015 at 8:12 AM, Andy Gospodarek
<gospo@cumulusnetworks.com> wrote:
> On Tue, Jun 02, 2015 at 10:03:23PM -0700, Scott Feldman wrote:
>> On Tue, Jun 2, 2015 at 8:07 PM, Andy Gospodarek
>> <gospo@cumulusnetworks.com> wrote:
>> > This patch adds the ability to have the Linux kernel track whether or
>> > not a particular route should be used based on the link-status of the
>> > interface associated with the next-hop.
>> >
>> > Before this patch any link-failure on an interface that was serving as a
>> > gateway for some systems could result in those systems being isolated
>> > from the rest of the network as the stack would continue to attempt to
>> > send frames out of an interface that is actually linked-down.  When the
>> > kernel is responsible for all forwarding, it should also be responsible
>> > for taking action when the traffic can no longer be forwarded -- there
>> > is no real need to outsource link-monitoring to userspace anymore.
>>
>> Hi Andy, how does this work for the hardware offload case?  I'm not
>> seeing how hardware gets updated when the software route is updated.
>> Seems hardware isn't updated and would send frames out the downed nh
>> interface.
>
> Scott, you are correct that this does not currently address the hardware
> offload case.  If one wanted offloaded hardware to reflect this change,
> then there needs to be either:
>
> - The ability to communicate the flag changes down to offload devices in
>   the event that they want some action to take place
>   switchdev_fib_ipv4_modify(), maybe? (Not to derail this convo too much
>   but it will soon be time to consider what to do at the switchdev layer
>   when route flags are modified like when the RTNH_F_OFFLOAD flag is
>   removed by the user if they feel the route does not need to be
>   offloaded for any reason.)

Switchdev_fib_ipv4_add() can handle adds and mods, so you can use
that.  Any place you're sending notification of route change, hook
switchdev so the offload device is updated with the route change.

Please include updates for switchdev for this patch v2.  I'm assuming
you're already shipping this feature on offloaded switches, so you
want it to work with switchdev, right?


> - Register a register_netdevice_notifier and take action at the
>   switchdev layer to make sure that routes can be removed or marked as
>   inactive when link goes down.

The core code is already handling the events, so it seems better to
call into switchdev from core code rather than having two listeners
potentially handling the event differently.

-scott
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman June 3, 2015, 5:47 p.m. UTC | #16
On Wed, Jun 03, 2015 at 11:16:27AM -0400, Andy Gospodarek wrote:
> On Wed, Jun 03, 2015 at 11:02:23AM -0400, Neil Horman wrote:
> > On Wed, Jun 03, 2015 at 10:46:22AM -0400, Andy Gospodarek wrote:
> > > On Wed, Jun 03, 2015 at 10:21:31AM -0400, Neil Horman wrote:
> > > > On Wed, Jun 03, 2015 at 04:13:08PM +0200, Hannes Frederic Sowa wrote:
> > > > > On Mi, 2015-06-03 at 09:53 -0400, Neil Horman wrote:
> > > > > > On Tue, Jun 02, 2015 at 11:07:19PM -0400, Andy Gospodarek wrote:
> > > > > > > This patch adds the ability to have the Linux kernel track whether or
> > > > > > > not a particular route should be used based on the link-status of the
> > > > > > > interface associated with the next-hop.
> > > > > > > 
> > > > > > > Before this patch any link-failure on an interface that was serving as a
> > > > > > > gateway for some systems could result in those systems being isolated
> > > > > > > from the rest of the network as the stack would continue to attempt to
> > > > > > > send frames out of an interface that is actually linked-down.  When the
> > > > > > > kernel is responsible for all forwarding, it should also be responsible
> > > > > > > for taking action when the traffic can no longer be forwarded -- there
> > > > > > > is no real need to outsource link-monitoring to userspace anymore.
> > > > > > > 
> > > > > > > This feature is only enabled with the new sysctl set (default is off):
> > > > > > > net.core.kill_routes_on_linkdown = 1
> > > > > > > 
> > > > > > > When this is set, the following behavior can be observed (interface p8p1
> > > > > > > is link-down):
> > > > > > > 
> > > > > > > # ip route show 
> > > > > > > default via 10.0.5.2 dev p9p1 
> > > > > > > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > > > > > > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > > > > > > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead 
> > > > > > > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead 
> > > > > > > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > > > > > > # ip route get 90.0.0.1 
> > > > > > > 90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1 
> > > > > > >     cache 
> > > > > > > # ip route get 80.0.0.1 
> > > > > > > local 80.0.0.1 dev lo  src 80.0.0.1 
> > > > > > >     cache <local> 
> > > > > > > # ip route get 80.0.0.2
> > > > > > > 80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15 
> > > > > > >     cache 
> > > > > > > 
> > > > > > > While the route does remain in the table (so it can be modified if
> > > > > > > needed rather than being wiped away as it would be if IFF_UP was
> > > > > > > cleared), the proper next-hop is chosen automatically when the link is
> > > > > > > down.  Now interface p8p1 is linked-up:
> > > > > > > 
> > > > > > > # ip route show 
> > > > > > > default via 10.0.5.2 dev p9p1 
> > > > > > > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > > > > > > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > > > > > > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 
> > > > > > > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 
> > > > > > > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > > > > > > 192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
> > > > > > > # ip route get 90.0.0.1 
> > > > > > > 90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1 
> > > > > > >     cache 
> > > > > > > # ip route get 80.0.0.1 
> > > > > > > local 80.0.0.1 dev lo  src 80.0.0.1 
> > > > > > >     cache <local> 
> > > > > > > # ip route get 80.0.0.2
> > > > > > > 80.0.0.2 dev p8p1  src 80.0.0.1 
> > > > > > >     cache 
> > > > > > > 
> > > > > > > and the output changes to what one would expect.
> > > > > > > 
> > > > > > > Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
> > > > > > > Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
> > > > > > > 
> > > > > > > ---
> > > > > > > Though there were some that preferred not to have a configuration option
> > > > > > > and to make this behavior the default when it was discussed in Ottawa
> > > > > > > earlier this year since "it was time to do this."  I wanted to propose
> > > > > > > the config option to preserve the current behavior for those that desire
> > > > > > > it.  I'll happily remove it if Dave and Linus approve.
> > > > > > > 
> > > > > > > An IPv6 implementation is also needed (DECnet too!), but I wanted to
> > > > > > > start with the IPv4 implementation to get people comfortable with the
> > > > > > > idea before moving forward.  If this is accepted the IPv6 implementation
> > > > > > > can be posted shortly.  
> > > > > > > 
> > > > > > > FWIW, we have been running this patch with the sysctl setting above and
> > > > > > > our customers have been happily using a backported version for IPv4 and
> > > > > > > IPv6 for >6 months.
> > > > > > > 
> > > > > > >  include/linux/netdevice.h      |  1 +
> > > > > > >  include/net/fib_rules.h        |  1 +
> > > > > > >  include/net/ip_fib.h           |  1 +
> > > > > > >  include/uapi/linux/rtnetlink.h |  1 +
> > > > > > >  include/uapi/linux/sysctl.h    |  1 +
> > > > > > >  kernel/sysctl_binary.c         |  1 +
> > > > > > >  net/core/dev.c                 |  2 ++
> > > > > > >  net/core/sysctl_net_core.c     |  7 +++++++
> > > > > > >  net/ipv4/fib_frontend.c        | 12 +++++++++--
> > > > > > >  net/ipv4/fib_rules.c           |  7 ++++++-
> > > > > > >  net/ipv4/fib_semantics.c       | 46 ++++++++++++++++++++++++++++++++++++------
> > > > > > >  net/ipv4/fib_trie.c            | 19 +++++++++++++----
> > > > > > >  12 files changed, 86 insertions(+), 13 deletions(-)
> > > > > > > 
> > > > > > 
> > > > > > Hey Andy-
> > > > > > 	So, the implementation looks great.  That said, a question comes up in
> > > > > > my mind that I can't quite resolve.  In looking at your code, and seeing how it
> > > > > > fit into the routing path, I came accross the function sync_fib_down_dev, which
> > > > > > appears to be called in response to NETDEV_DOWN events.  Its purpose from my
> > > > > > read is to interrogate the fib for routes that specify the device going down as
> > > > > > the egress device, and mark said routes as dead.  I'm guessing that I'm missing
> > > > > > something here, but it seems as though this patch shouldn't be needed in light
> > > > > > of that behavior (unless the existing code is somehow broken, or I'm confused
> > > > > > about its purpose).  Could it be that whats really needed here is for Network
> > > > > > Manager (or other user space programs ) to stop removing routes in response to
> > > > > > NETDEV_DOWN events so that the above kernel code can do its job?
> > > > > 
> > > > > We don't care about DEAD flag during normal fib lookup, only when
> > > > > considering multipath routes.
> > > > > 
> > > > 
> > > > I get that, but I think the question still applies.  If you have multiple routes
> > > > to the same destination, the sync_fib_down_dev code should still be doing the
> > > > job that andy's patch does, shouldn't it?  I.e. if you have two interfaces with
> > > > ip addresses on the same subnet, and a route for each interface, only the first
> > > > route in the table typically gets used, because its found first.  But if that
> > > > link goes down, the sync_fib_down_dev code should mark the links route as dead,
> > > > and subsequent lookups should use the alternate interface.  But it seems like
> > > > thats not happening, so I'm wondering why that is.
> > > 
> > > Neil, I think I see what you are asking, let me see if I can clarify
> > > this.  The confusing part here is that NETDEV_DOWN events are done when
> > > IFF_UP is cleared, not when link changes (those are NETDEV_CHANGE
> > > events), but sync_fib_down_dev() is called by both with this patch.
> > > 
> > Ok, I can see that, but shouldn't we get IFF_UP cleared when we loose link
> > anyway? I.e. if we loose link, we loose carrier, and as a result we get a
> > linkwatch event that takes down the interface, no?
> 
> operstate is cleared by linkwatch (so 'ip link' output shows
> 'NO-CARRIER' vs 'LOWER_UP'), but IFF_UP is not.  
> 
Ah, thank you Andy, thats the bit I was missing, I thought dev_deactive, as
called from linkwatch_run_queue cleared IFF_UP, but it doesn't.

Thanks for the clarification!
Neil

> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Gospodarek June 3, 2015, 5:57 p.m. UTC | #17
On Wed, Jun 03, 2015 at 10:40:26AM -0700, Scott Feldman wrote:
> On Wed, Jun 3, 2015 at 8:12 AM, Andy Gospodarek
> <gospo@cumulusnetworks.com> wrote:
> > On Tue, Jun 02, 2015 at 10:03:23PM -0700, Scott Feldman wrote:
> >> On Tue, Jun 2, 2015 at 8:07 PM, Andy Gospodarek
> >> <gospo@cumulusnetworks.com> wrote:
> >> > This patch adds the ability to have the Linux kernel track whether or
> >> > not a particular route should be used based on the link-status of the
> >> > interface associated with the next-hop.
> >> >
> >> > Before this patch any link-failure on an interface that was serving as a
> >> > gateway for some systems could result in those systems being isolated
> >> > from the rest of the network as the stack would continue to attempt to
> >> > send frames out of an interface that is actually linked-down.  When the
> >> > kernel is responsible for all forwarding, it should also be responsible
> >> > for taking action when the traffic can no longer be forwarded -- there
> >> > is no real need to outsource link-monitoring to userspace anymore.
> >>
> >> Hi Andy, how does this work for the hardware offload case?  I'm not
> >> seeing how hardware gets updated when the software route is updated.
> >> Seems hardware isn't updated and would send frames out the downed nh
> >> interface.
> >
> > Scott, you are correct that this does not currently address the hardware
> > offload case.  If one wanted offloaded hardware to reflect this change,
> > then there needs to be either:
> >
> > - The ability to communicate the flag changes down to offload devices in
> >   the event that they want some action to take place
> >   switchdev_fib_ipv4_modify(), maybe? (Not to derail this convo too much
> >   but it will soon be time to consider what to do at the switchdev layer
> >   when route flags are modified like when the RTNH_F_OFFLOAD flag is
> >   removed by the user if they feel the route does not need to be
> >   offloaded for any reason.)
> 
> Switchdev_fib_ipv4_add() can handle adds and mods, so you can use
> that.  Any place you're sending notification of route change, hook
> switchdev so the offload device is updated with the route change.
I should know better than to ask this, but does just the rocker
implementation presume this operation is an add and modify or was this
the implementation from the beginning.  (Forgive me for not remembering
earlier discussion on this.)

> 
> Please include updates for switchdev for this patch v2.
I knew there was a reason why I should have submitted these patches
earlier.  ;-)

> I'm assuming you're already shipping this feature on offloaded
> switches, so you want it to work with switchdev, right?
Of course I think there is value there and I'm happy to do it.  This is
quite useful in the case of a forwarding element, so having support
switchdev is something I would like to see.

> 
> 
> > - Register a register_netdevice_notifier and take action at the
> >   switchdev layer to make sure that routes can be removed or marked as
> >   inactive when link goes down.
> 
> The core code is already handling the events, so it seems better to
> call into switchdev from core code rather than having two listeners
> potentially handling the event differently.
> 
> -scott
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Scott Feldman June 3, 2015, 6:12 p.m. UTC | #18
On Wed, Jun 3, 2015 at 10:57 AM, Andy Gospodarek
<gospo@cumulusnetworks.com> wrote:
> On Wed, Jun 03, 2015 at 10:40:26AM -0700, Scott Feldman wrote:
>> On Wed, Jun 3, 2015 at 8:12 AM, Andy Gospodarek
>> <gospo@cumulusnetworks.com> wrote:
>> > On Tue, Jun 02, 2015 at 10:03:23PM -0700, Scott Feldman wrote:
>> >> On Tue, Jun 2, 2015 at 8:07 PM, Andy Gospodarek
>> >> <gospo@cumulusnetworks.com> wrote:
>> >> > This patch adds the ability to have the Linux kernel track whether or
>> >> > not a particular route should be used based on the link-status of the
>> >> > interface associated with the next-hop.
>> >> >
>> >> > Before this patch any link-failure on an interface that was serving as a
>> >> > gateway for some systems could result in those systems being isolated
>> >> > from the rest of the network as the stack would continue to attempt to
>> >> > send frames out of an interface that is actually linked-down.  When the
>> >> > kernel is responsible for all forwarding, it should also be responsible
>> >> > for taking action when the traffic can no longer be forwarded -- there
>> >> > is no real need to outsource link-monitoring to userspace anymore.
>> >>
>> >> Hi Andy, how does this work for the hardware offload case?  I'm not
>> >> seeing how hardware gets updated when the software route is updated.
>> >> Seems hardware isn't updated and would send frames out the downed nh
>> >> interface.
>> >
>> > Scott, you are correct that this does not currently address the hardware
>> > offload case.  If one wanted offloaded hardware to reflect this change,
>> > then there needs to be either:
>> >
>> > - The ability to communicate the flag changes down to offload devices in
>> >   the event that they want some action to take place
>> >   switchdev_fib_ipv4_modify(), maybe? (Not to derail this convo too much
>> >   but it will soon be time to consider what to do at the switchdev layer
>> >   when route flags are modified like when the RTNH_F_OFFLOAD flag is
>> >   removed by the user if they feel the route does not need to be
>> >   offloaded for any reason.)
>>
>> Switchdev_fib_ipv4_add() can handle adds and mods, so you can use
>> that.  Any place you're sending notification of route change, hook
>> switchdev so the offload device is updated with the route change.
> I should know better than to ask this, but does just the rocker
> implementation presume this operation is an add and modify or was this
> the implementation from the beginning.  (Forgive me for not remembering
> earlier discussion on this.)

The intent was add/mod, but looks like I need to update some
documentation; I'll do that.  See fib_trie.c:fib_table_insert case for
NLM_F_REPLACE...that's a mod case.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Scott Feldman June 3, 2015, 6:15 p.m. UTC | #19
On Tue, Jun 2, 2015 at 8:07 PM, Andy Gospodarek
<gospo@cumulusnetworks.com> wrote:
> This feature is only enabled with the new sysctl set (default is off):
> net.core.kill_routes_on_linkdown = 1

One more thing, sorry.  This feature is typically implemented today in
user-space on a per-interface basis.  The example I'm thinking of is
Quagga's "link-detect" directive which goes on an interface.  Should
this be a bool on each interface in systcl?  That would let user not
enable on selected interfaces.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Gospodarek June 3, 2015, 6:16 p.m. UTC | #20
On Wed, Jun 03, 2015 at 11:12:11AM -0700, Scott Feldman wrote:
> On Wed, Jun 3, 2015 at 10:57 AM, Andy Gospodarek
> <gospo@cumulusnetworks.com> wrote:
> > On Wed, Jun 03, 2015 at 10:40:26AM -0700, Scott Feldman wrote:
> >> On Wed, Jun 3, 2015 at 8:12 AM, Andy Gospodarek
> >> <gospo@cumulusnetworks.com> wrote:
> >> > On Tue, Jun 02, 2015 at 10:03:23PM -0700, Scott Feldman wrote:
> >> >> On Tue, Jun 2, 2015 at 8:07 PM, Andy Gospodarek
> >> >> <gospo@cumulusnetworks.com> wrote:
> >> >> > This patch adds the ability to have the Linux kernel track whether or
> >> >> > not a particular route should be used based on the link-status of the
> >> >> > interface associated with the next-hop.
> >> >> >
> >> >> > Before this patch any link-failure on an interface that was serving as a
> >> >> > gateway for some systems could result in those systems being isolated
> >> >> > from the rest of the network as the stack would continue to attempt to
> >> >> > send frames out of an interface that is actually linked-down.  When the
> >> >> > kernel is responsible for all forwarding, it should also be responsible
> >> >> > for taking action when the traffic can no longer be forwarded -- there
> >> >> > is no real need to outsource link-monitoring to userspace anymore.
> >> >>
> >> >> Hi Andy, how does this work for the hardware offload case?  I'm not
> >> >> seeing how hardware gets updated when the software route is updated.
> >> >> Seems hardware isn't updated and would send frames out the downed nh
> >> >> interface.
> >> >
> >> > Scott, you are correct that this does not currently address the hardware
> >> > offload case.  If one wanted offloaded hardware to reflect this change,
> >> > then there needs to be either:
> >> >
> >> > - The ability to communicate the flag changes down to offload devices in
> >> >   the event that they want some action to take place
> >> >   switchdev_fib_ipv4_modify(), maybe? (Not to derail this convo too much
> >> >   but it will soon be time to consider what to do at the switchdev layer
> >> >   when route flags are modified like when the RTNH_F_OFFLOAD flag is
> >> >   removed by the user if they feel the route does not need to be
> >> >   offloaded for any reason.)
> >>
> >> Switchdev_fib_ipv4_add() can handle adds and mods, so you can use
> >> that.  Any place you're sending notification of route change, hook
> >> switchdev so the offload device is updated with the route change.
> > I should know better than to ask this, but does just the rocker
> > implementation presume this operation is an add and modify or was this
> > the implementation from the beginning.  (Forgive me for not remembering
> > earlier discussion on this.)
> 
> The intent was add/mod, but looks like I need to update some
> documentation; I'll do that.  See fib_trie.c:fib_table_insert case for
> NLM_F_REPLACE...that's a mod case.
The docs/comments in the code were my concern.  Thanks for doing that.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander H Duyck June 3, 2015, 6:25 p.m. UTC | #21
On 06/02/2015 08:07 PM, Andy Gospodarek wrote:
> This patch adds the ability to have the Linux kernel track whether or
> not a particular route should be used based on the link-status of the
> interface associated with the next-hop.
>
> Before this patch any link-failure on an interface that was serving as a
> gateway for some systems could result in those systems being isolated
> from the rest of the network as the stack would continue to attempt to
> send frames out of an interface that is actually linked-down.  When the
> kernel is responsible for all forwarding, it should also be responsible
> for taking action when the traffic can no longer be forwarded -- there
> is no real need to outsource link-monitoring to userspace anymore.
>
> This feature is only enabled with the new sysctl set (default is off):
> net.core.kill_routes_on_linkdown = 1
>
> When this is set, the following behavior can be observed (interface p8p1
> is link-down):
>
> # ip route show
> default via 10.0.5.2 dev p9p1
> 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
> 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
> 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead
> 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead
> 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
> # ip route get 90.0.0.1
> 90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1
>      cache
> # ip route get 80.0.0.1
> local 80.0.0.1 dev lo  src 80.0.0.1
>      cache <local>
> # ip route get 80.0.0.2
> 80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15
>      cache
>
> While the route does remain in the table (so it can be modified if
> needed rather than being wiped away as it would be if IFF_UP was
> cleared), the proper next-hop is chosen automatically when the link is
> down.  Now interface p8p1 is linked-up:
>
> # ip route show
> default via 10.0.5.2 dev p9p1
> 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
> 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
> 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
> 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1
> 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
> 192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
> # ip route get 90.0.0.1
> 90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1
>      cache
> # ip route get 80.0.0.1
> local 80.0.0.1 dev lo  src 80.0.0.1
>      cache <local>
> # ip route get 80.0.0.2
> 80.0.0.2 dev p8p1  src 80.0.0.1
>      cache
>
> and the output changes to what one would expect.
>
> Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
> Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
>
> ---
> Though there were some that preferred not to have a configuration option
> and to make this behavior the default when it was discussed in Ottawa
> earlier this year since "it was time to do this."  I wanted to propose
> the config option to preserve the current behavior for those that desire
> it.  I'll happily remove it if Dave and Linus approve.
>
> An IPv6 implementation is also needed (DECnet too!), but I wanted to
> start with the IPv4 implementation to get people comfortable with the
> idea before moving forward.  If this is accepted the IPv6 implementation
> can be posted shortly.
>
> FWIW, we have been running this patch with the sysctl setting above and
> our customers have been happily using a backported version for IPv4 a
> IPv6 for >6 months.

Really the patch below doesn't completely jive with what you have stated 
in the patch description above.  I would have really much rather seen 
the DEAD_LINKDOWN be the only behavior you changed.  Instead there are a 
number of changes to the DEAD flag that I am not sure are really necessary.

I would have rather seen DEAD_LINKDOWN treated more like something where 
the metric is set to an infinite value than something where it starts to 
get in the way of things like flushing the table.

>   include/linux/netdevice.h      |  1 +
>   include/net/fib_rules.h        |  1 +
>   include/net/ip_fib.h           |  1 +
>   include/uapi/linux/rtnetlink.h |  1 +
>   include/uapi/linux/sysctl.h    |  1 +
>   kernel/sysctl_binary.c         |  1 +
>   net/core/dev.c                 |  2 ++
>   net/core/sysctl_net_core.c     |  7 +++++++
>   net/ipv4/fib_frontend.c        | 12 +++++++++--
>   net/ipv4/fib_rules.c           |  7 ++++++-
>   net/ipv4/fib_semantics.c       | 46 ++++++++++++++++++++++++++++++++++++------
>   net/ipv4/fib_trie.c            | 19 +++++++++++++----
>   12 files changed, 86 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 6f5f71f..5bd953c 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2986,6 +2986,7 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
>   bool is_skb_forwardable(struct net_device *dev, struct sk_buff *skb);
>
>   extern int		netdev_budget;
> +extern int		kill_routes_on_linkdown;
>
>   /* Called by rtnetlink.c:rtnl_unlock() */
>   void netdev_run_todo(void);
> diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
> index 6d67383..4fbfda5 100644
> --- a/include/net/fib_rules.h
> +++ b/include/net/fib_rules.h
> @@ -37,6 +37,7 @@ struct fib_lookup_arg {
>   	struct fib_rule		*rule;
>   	int			flags;
>   #define FIB_LOOKUP_NOREF	1
> +#define FIB_LOOKUP_ALLOWDEAD	2
>   };
>
>   struct fib_rules_ops {
> diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> index 54271ed..efb195b 100644
> --- a/include/net/ip_fib.h
> +++ b/include/net/ip_fib.h
> @@ -250,6 +250,7 @@ struct fib_table *fib_new_table(struct net *net, u32 id);
>   struct fib_table *fib_get_table(struct net *net, u32 id);
>
>   int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res);
> +int fib_lookup_flags(struct net *n, struct flowi4 *flp, struct fib_result *res, int flags);
>
>   static inline int fib_lookup(struct net *net, struct flowi4 *flp,
>   			     struct fib_result *res)
> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
> index 17fb02f..bc264d4 100644
> --- a/include/uapi/linux/rtnetlink.h
> +++ b/include/uapi/linux/rtnetlink.h
> @@ -338,6 +338,7 @@ struct rtnexthop {
>   #define RTNH_F_PERVASIVE	2	/* Do recursive gateway lookup	*/
>   #define RTNH_F_ONLINK		4	/* Gateway is forced on link	*/
>   #define RTNH_F_OFFLOAD		8	/* offloaded route */
> +#define RTNH_F_DEAD_LINKDOWN	16	/* Route dead due to carrier down */
>
>   /* Macros to handle hexthops */
>
> diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
> index 0956373..b6632dc 100644
> --- a/include/uapi/linux/sysctl.h
> +++ b/include/uapi/linux/sysctl.h
> @@ -276,6 +276,7 @@ enum
>   	NET_CORE_AEVENT_ETIME=20,
>   	NET_CORE_AEVENT_RSEQTH=21,
>   	NET_CORE_WARNINGS=22,
> +	NET_CORE_KILL_ROUTES_ON_LINKDOWN=23,
>   };
>
>   /* /proc/sys/net/ethernet */
> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
> index 7e7746a..f319116 100644
> --- a/kernel/sysctl_binary.c
> +++ b/kernel/sysctl_binary.c
> @@ -196,6 +196,7 @@ static const struct bin_table bin_net_core_table[] = {
>   	{ CTL_INT,	NET_CORE_AEVENT_ETIME,	"xfrm_aevent_etime" },
>   	{ CTL_INT,	NET_CORE_AEVENT_RSEQTH,	"xfrm_aevent_rseqth" },
>   	{ CTL_INT,	NET_CORE_WARNINGS,	"warnings" },
> +	{ CTL_INT,	NET_CORE_KILL_ROUTES_ON_LINKDOWN,	"kill_routes_on_linkdown" },
>   	{},
>   };
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0602e91..50dc19c 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3167,6 +3167,8 @@ EXPORT_SYMBOL(netdev_max_backlog);
>   int netdev_tstamp_prequeue __read_mostly = 1;
>   int netdev_budget __read_mostly = 300;
>   int weight_p __read_mostly = 64;            /* old backlog weight */
> +int kill_routes_on_linkdown = 0;
> +EXPORT_SYMBOL(kill_routes_on_linkdown);
>
>   /* Called with irq disabled */
>   static inline void ____napi_schedule(struct softnet_data *sd,

You might want to move this up or down a few lines.  It doesn't really 
make much sense to drop routing specific stuff in the middle of the NAPI 
receive definitions.

It would be nice if you could find some place a bit more routing related 
to place them.

> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index 95b6139..fef1804 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c
> @@ -392,6 +392,13 @@ static struct ctl_table net_core_table[] = {
>   		.mode		= 0644,
>   		.proc_handler	= proc_dointvec
>   	},
> +	{
> +		.procname	= "kill_routes_on_linkdown",
> +		.data		= &kill_routes_on_linkdown,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec
> +	},
>   	{ }
>   };
>
> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> index 872494e..94348c7 100644
> --- a/net/ipv4/fib_frontend.c
> +++ b/net/ipv4/fib_frontend.c
> @@ -1107,6 +1107,7 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
>   	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
>   	struct in_device *in_dev;
>   	struct net *net = dev_net(dev);
> +	unsigned flags;
>
>   	if (event == NETDEV_UNREGISTER) {
>   		fib_disable_ip(dev, 2);
> @@ -1130,10 +1131,17 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
>   		rt_cache_flush(net);
>   		break;
>   	case NETDEV_DOWN:
> -		fib_disable_ip(dev, 0);
> +		fib_disable_ip(dev, 1);
>   		break;
> -	case NETDEV_CHANGEMTU:
>   	case NETDEV_CHANGE:
> +		if (kill_routes_on_linkdown) {
> +			flags = dev_get_flags(dev);
> +			if (flags & (IFF_RUNNING|IFF_LOWER_UP))
> +				fib_sync_up(dev);
> +			else
> +				fib_sync_down_dev(dev, 0);
> +		}
> +	case NETDEV_CHANGEMTU:
>   		rt_cache_flush(net);
>   		break;
>   	}

I thought the value of 1 is already being used in fib_inetaddr_event.

> diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
> index 5615198..d135ec9 100644
> --- a/net/ipv4/fib_rules.c
> +++ b/net/ipv4/fib_rules.c
> @@ -49,9 +49,14 @@ struct fib4_rule {
>
>   int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res)
>   {
> +	return fib_lookup_flags(net, flp, res, 0);
> +}
> +
> +int fib_lookup_flags(struct net *net, struct flowi4 *flp, struct fib_result *res, int flags)
> +{
>   	struct fib_lookup_arg arg = {
>   		.result = res,
> -		.flags = FIB_LOOKUP_NOREF,
> +		.flags = FIB_LOOKUP_NOREF | flags,
>   	};
>   	int err;
>

You would probably be better off forking this function so that you have 
a version that can lookup dead routes.  Since this is hot-path we don't 
wan to have to split this up over too many functions.

> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index 28ec3c1..c0874ee 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -266,7 +266,7 @@ static inline int nh_comp(const struct fib_info *fi, const struct fib_info *ofi)
>   #ifdef CONFIG_IP_ROUTE_CLASSID
>   		    nh->nh_tclassid != onh->nh_tclassid ||
>   #endif
> -		    ((nh->nh_flags ^ onh->nh_flags) & ~RTNH_F_DEAD))
> +		    ((nh->nh_flags ^ onh->nh_flags) & ~(RTNH_F_DEAD|RTNH_F_DEAD_LINKDOWN)))
>   			return -1;
>   		onh++;
>   	} endfor_nexthops(fi);

You might just want to come up with a define to cover both dead cases. 
Maybe something like "RTNH_F_DEAD_ANY"

> @@ -318,7 +318,7 @@ static struct fib_info *fib_find_info(const struct fib_info *nfi)
>   		    nfi->fib_type == fi->fib_type &&
>   		    memcmp(nfi->fib_metrics, fi->fib_metrics,
>   			   sizeof(u32) * RTAX_MAX) == 0 &&
> -		    ((nfi->fib_flags ^ fi->fib_flags) & ~RTNH_F_DEAD) == 0 &&
> +		    ((nfi->fib_flags ^ fi->fib_flags) & ~(RTNH_F_DEAD|RTNH_F_DEAD_LINKDOWN)) == 0 &&
>   		    (nfi->fib_nhs == 0 || nh_comp(fi, nfi) == 0))
>   			return fi;
>   	}
> @@ -604,6 +604,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
>   				return -ENODEV;
>   			if (!(dev->flags & IFF_UP))
>   				return -ENETDOWN;
> +			if (!netif_carrier_ok(dev) && kill_routes_on_linkdown)
> +				nh->nh_flags |= RTNH_F_DEAD;
>   			nh->nh_dev = dev;
>   			dev_hold(dev);
>   			nh->nh_scope = RT_SCOPE_LINK;

It seems like you really should be using the DEAD_LINKDOWN here since 
this is what you are trying to indicate.

Also it really seems like you should move kill_routes_on_linkdown to 
fib_table_lookup.  You should track the extra flag always, and only skip 
the routes if kill_routes_on_linkdown is set.  Then you can override 
that with the extra flag being passed to fib_table_lookup.

Actually now that I think about it you could probably use 
kill_routes_on_linkdown in __fib_lookup to set a flag indicating that 
you want to kill the routes, otherwise just use the default behavior.

> @@ -621,7 +623,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
>   			/* It is not necessary, but requires a bit of thinking */
>   			if (fl4.flowi4_scope < RT_SCOPE_LINK)
>   				fl4.flowi4_scope = RT_SCOPE_LINK;
> -			err = fib_lookup(net, &fl4, &res);
> +			err = fib_lookup_flags(net, &fl4, &res, FIB_LOOKUP_ALLOWDEAD);
>   			if (err) {
>   				rcu_read_unlock();
>   				return err;
> @@ -636,6 +638,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
>   		if (!dev)
>   			goto out;
>   		dev_hold(dev);
> +		if (!netif_carrier_ok(dev) && kill_routes_on_linkdown)
> +			nh->nh_flags |= RTNH_F_DEAD;
>   		err = (dev->flags & IFF_UP) ? 0 : -ENETDOWN;
>   	} else {
>   		struct in_device *in_dev;

Same here.  There is no point in overloading RTNH_F_DEAD if you already 
are adding a new bit that will represent something similar.

> @@ -654,6 +658,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
>   		nh->nh_dev = in_dev->dev;
>   		dev_hold(nh->nh_dev);
>   		nh->nh_scope = RT_SCOPE_HOST;
> +		if (!netif_carrier_ok(nh->nh_dev) && kill_routes_on_linkdown)
> +			nh->nh_flags |= RTNH_F_DEAD;
>   		err = 0;
>   	}
>   out:
> @@ -760,6 +766,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
>   	struct fib_info *ofi;
>   	int nhs = 1;
>   	struct net *net = cfg->fc_nlinfo.nl_net;
> +	int dead;
>
>   	if (cfg->fc_type > RTN_MAX)
>   		goto err_inval;
> @@ -920,11 +927,18 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
>   		if (!nh->nh_dev)
>   			goto failure;
>   	} else {
> +		dead = 0;
>   		change_nexthops(fi) {
>   			err = fib_check_nh(cfg, fi, nexthop_nh);
>   			if (err != 0)
>   				goto failure;
> +			if (nexthop_nh->nh_flags & RTNH_F_DEAD)
> +				dead++;
>   		} endfor_nexthops(fi)
> +		if ((dead == fi->fib_nhs)) {
> +			fi->fib_flags |= RTNH_F_DEAD;
> +			fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
> +		}
>   	}
>
>   	if (fi->fib_prefsrc) {
> @@ -1097,6 +1111,8 @@ int fib_sync_down_addr(struct net *net, __be32 local)
>   			continue;
>   		if (fi->fib_prefsrc == local) {
>   			fi->fib_flags |= RTNH_F_DEAD;
> +			/* Addr is gone, more serious than a linkdown */
> +			fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
>   			ret++;
>   		}
>   	}
> @@ -1112,7 +1128,7 @@ int fib_sync_down_dev(struct net_device *dev, int force)
>   	struct hlist_head *head = &fib_info_devhash[hash];
>   	struct fib_nh *nh;
>
> -	if (force)
> +	if (force > 1)
>   		scope = -1;
>
>   	hlist_for_each_entry(nh, head, nh_hash) {
> @@ -1147,6 +1163,11 @@ int fib_sync_down_dev(struct net_device *dev, int force)
>   		} endfor_nexthops(fi)
>   		if (dead == fi->fib_nhs) {
>   			fi->fib_flags |= RTNH_F_DEAD;
> +			/* force marks route down due to admin down and device removal. */
> +			if (!force)
> +				fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
> +			else
> +				fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
>   			ret++;
>   		}
>   	}

So what is the idea behind changing the force value like this?  It seems 
like it would make this much more readable if you were to simply use a 
value such as -1 to handle your DEAD_LINKDOWN case rather than altering 
the behavior for the values 1 and 0.

> @@ -1223,10 +1244,12 @@ int fib_sync_up(struct net_device *dev)
>   	struct hlist_head *head;
>   	struct fib_nh *nh;
>   	int ret;
> +	int link_up;
>
>   	if (!(dev->flags & IFF_UP))
>   		return 0;
>
> +	link_up = netif_carrier_ok(dev);
>   	prev_fi = NULL;
>   	hash = fib_devindex_hashfn(dev->ifindex);
>   	head = &fib_info_devhash[hash];
> @@ -1253,16 +1276,27 @@ int fib_sync_up(struct net_device *dev)
>   			if (nexthop_nh->nh_dev != dev ||
>   			    !__in_dev_get_rtnl(dev))
>   				continue;
> -			alive++;
> +			if (link_up) {
> +				/* Link is up, so mark NH as alive */
> +				nexthop_nh->nh_flags &= ~RTNH_F_DEAD;
> +				alive++;
> +			} else
> +				nexthop_nh->nh_flags |= RTNH_F_DEAD;
> +
>   			spin_lock_bh(&fib_multipath_lock);
>   			nexthop_nh->nh_power = 0;
> -			nexthop_nh->nh_flags &= ~RTNH_F_DEAD;
>   			spin_unlock_bh(&fib_multipath_lock);
>   		} endfor_nexthops(fi)
>
>   		if (alive > 0) {
> +			/* Some NHs are alive, unmark the route as dead */
> +			fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
>   			fi->fib_flags &= ~RTNH_F_DEAD;
>   			ret++;
> +		} else {
> +			/* No NHs are alive, mark the route as dead */
> +			fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
> +			fi->fib_flags |= RTNH_F_DEAD;
>   		}
>   	}
>
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 01bce15..eedf287 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -1401,12 +1401,18 @@ found:
>   #endif
>   			return err;
>   		}
> -		if (fi->fib_flags & RTNH_F_DEAD)
> -			continue;
> +		if (!(fib_flags & FIB_LOOKUP_ALLOWDEAD)) {
> +			/* if route is dead and link is down, keep looking  */
> +			if ((fi->fib_flags & RTNH_F_DEAD) &&
> +			    (fi->fib_flags & RTNH_F_DEAD_LINKDOWN))
> +				continue;
> +		}

The overloading of things makes this bit confusing.  I would much rather 
have seen this as:
		if (fi->fib_flags & RTNH_F_DEAD)
			continue;
		if ((fi->fib_flags & RTNH_F_DEAD_LINKDOWN) &&
		    !(fib_flags & FIB_LOOKUP_ALLOWDEAD))
			continue;

>   		for (nhsel = 0; nhsel < fi->fib_nhs; nhsel++) {
>   			const struct fib_nh *nh = &fi->fib_nh[nhsel];
>
> -			if (nh->nh_flags & RTNH_F_DEAD)
> +			/* allow next-hop to be added if link is down */
> +			if ((nh->nh_flags & RTNH_F_DEAD) &&
> +			    !(fi->fib_flags & RTNH_F_DEAD_LINKDOWN))
>   				continue;
>   			if (flp->flowi4_oif && flp->flowi4_oif != nh->nh_oif)
>   				continue;

Same here.  By combining the DEAD and DEAD_LINKDOWN the way they have 
been it is hard to tell exaclty what is going on.  It would be much 
easier to sort all of this out if DEAD was left as DEAD, and 
DEAD_LINKDOWN was handled as something similar with option to override.

> @@ -1829,7 +1835,12 @@ int fib_table_flush(struct fib_table *tb)
>   		hlist_for_each_entry_safe(fa, tmp, &n->leaf, fa_list) {
>   			struct fib_info *fi = fa->fa_info;
>
> -			if (!fi || !(fi->fib_flags & RTNH_F_DEAD)) {
> +			/* DEAD and DEAD_LINKDOWN will not both be set
> +			 * with IFF_UP is cleared, so do not flush
> +			 * entries with only DEAD set
> +			 */
> +			if (!fi || !((fi->fib_flags & RTNH_F_DEAD) &&
> +			    !(fi->fib_flags & RTNH_F_DEAD_LINKDOWN))) {
>   				slen = fa->fa_slen;
>   				continue;
>   			}
>

First you could do this flags check with just one and:
	if (!fi || !(fi->fib_flags & (RTNH_F_DEAD |
				      RTNH_F_DEAD_LINKDOWN)) {

Second I am not a fan of this flag messing with stuff such as the suffix 
length since we are now cut out of the search for just a link down event 
and the fact is link down/link up events can occur quickly and in 
significant quantaties int he case of a link bouncing.

Really what I would prefer to see is that this route gets ignored in 
fib_table_lookup in the case of a link down, and then we can avoid all 
of the ugly messing with RTNH_F_DEAD that seems to be happening as it 
makes it much more difficult to decode the actual state with the two flags.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Gospodarek June 3, 2015, 6:27 p.m. UTC | #22
On Wed, Jun 03, 2015 at 11:15:55AM -0700, Scott Feldman wrote:
> On Tue, Jun 2, 2015 at 8:07 PM, Andy Gospodarek
> <gospo@cumulusnetworks.com> wrote:
> > This feature is only enabled with the new sysctl set (default is off):
> > net.core.kill_routes_on_linkdown = 1
> 
> One more thing, sorry.  This feature is typically implemented today in
> user-space on a per-interface basis.  The example I'm thinking of is
> Quagga's "link-detect" directive which goes on an interface.  Should
> this be a bool on each interface in systcl?  That would let user not
> enable on selected interfaces.
That would not be my preference.

I'm willing to investigate the per-namespace support if Hannes would
like and add switchdev support for v2, but would prefer this not become
that granular.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Scott Feldman June 3, 2015, 7:03 p.m. UTC | #23
On Wed, Jun 3, 2015 at 11:27 AM, Andy Gospodarek
<gospo@cumulusnetworks.com> wrote:
> On Wed, Jun 03, 2015 at 11:15:55AM -0700, Scott Feldman wrote:
>> On Tue, Jun 2, 2015 at 8:07 PM, Andy Gospodarek
>> <gospo@cumulusnetworks.com> wrote:
>> > This feature is only enabled with the new sysctl set (default is off):
>> > net.core.kill_routes_on_linkdown = 1
>>
>> One more thing, sorry.  This feature is typically implemented today in
>> user-space on a per-interface basis.  The example I'm thinking of is
>> Quagga's "link-detect" directive which goes on an interface.  Should
>> this be a bool on each interface in systcl?  That would let user not
>> enable on selected interfaces.
> That would not be my preference.
>
> I'm willing to investigate the per-namespace support if Hannes would
> like and add switchdev support for v2, but would prefer this not become
> that granular.


Is there a reason?  Seems you moving a feature from user space to the
kernel, so I'm curious why you wouldn't want to mimic functionality
that's available today.

I like the patch/feature, BTW.  Thanks for doing this.

-scott
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Gospodarek June 3, 2015, 8:02 p.m. UTC | #24
On Wed, Jun 03, 2015 at 11:25:09AM -0700, Alexander Duyck wrote:
> On 06/02/2015 08:07 PM, Andy Gospodarek wrote:
[...]
> >Though there were some that preferred not to have a configuration option
> >and to make this behavior the default when it was discussed in Ottawa
> >earlier this year since "it was time to do this."  I wanted to propose
> >the config option to preserve the current behavior for those that desire
> >it.  I'll happily remove it if Dave and Linus approve.
> >
> >An IPv6 implementation is also needed (DECnet too!), but I wanted to
> >start with the IPv4 implementation to get people comfortable with the
> >idea before moving forward.  If this is accepted the IPv6 implementation
> >can be posted shortly.
> >
> >FWIW, we have been running this patch with the sysctl setting above and
> >our customers have been happily using a backported version for IPv4 a
> >IPv6 for >6 months.
> 
> Really the patch below doesn't completely jive with what you have stated in
> the patch description above.  I would have really much rather seen the
> DEAD_LINKDOWN be the only behavior you changed.  Instead there are a number
> of changes to the DEAD flag that I am not sure are really necessary.
The main reason for the overload (which seems to be the source of most
of your comments), was to allow routes to be reported back as dead to
userspace with no modification to iproute2 and friends since there is
already the ability to print that a route is dead and that implication
is pretty clear.

The other thing I wanted to avoid is to have users update their kernel
(and the sysctl option) while not updating userspace and then have no
real clue why routes are not being selected since their iproute2 output
would not look any different.  It seems preservation of userspace seems
paramount, even in the face of a different config of the kernel.  

> I would have rather seen DEAD_LINKDOWN treated more like something where the
> metric is set to an infinite value than something where it starts to get in
> the way of things like flushing the table.
[...]
> >
> >diff --git a/net/core/dev.c b/net/core/dev.c
> >index 0602e91..50dc19c 100644
> >--- a/net/core/dev.c
> >+++ b/net/core/dev.c
> >@@ -3167,6 +3167,8 @@ EXPORT_SYMBOL(netdev_max_backlog);
> >  int netdev_tstamp_prequeue __read_mostly = 1;
> >  int netdev_budget __read_mostly = 300;
> >  int weight_p __read_mostly = 64;            /* old backlog weight */
> >+int kill_routes_on_linkdown = 0;
> >+EXPORT_SYMBOL(kill_routes_on_linkdown);
> >
> >  /* Called with irq disabled */
> >  static inline void ____napi_schedule(struct softnet_data *sd,
> 
> You might want to move this up or down a few lines.  It doesn't really make
> much sense to drop routing specific stuff in the middle of the NAPI receive
> definitions.
> 
> It would be nice if you could find some place a bit more routing related to
> place them.
No problem.

> >diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> >index 95b6139..fef1804 100644
> >--- a/net/core/sysctl_net_core.c
> >+++ b/net/core/sysctl_net_core.c
> >@@ -392,6 +392,13 @@ static struct ctl_table net_core_table[] = {
> >  		.mode		= 0644,
> >  		.proc_handler	= proc_dointvec
> >  	},
> >+	{
> >+		.procname	= "kill_routes_on_linkdown",
> >+		.data		= &kill_routes_on_linkdown,
> >+		.maxlen		= sizeof(int),
> >+		.mode		= 0644,
> >+		.proc_handler	= proc_dointvec
> >+	},
> >  	{ }
> >  };
> >
> >diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> >index 872494e..94348c7 100644
> >--- a/net/ipv4/fib_frontend.c
> >+++ b/net/ipv4/fib_frontend.c
> >@@ -1107,6 +1107,7 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
> >  	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> >  	struct in_device *in_dev;
> >  	struct net *net = dev_net(dev);
> >+	unsigned flags;
> >
> >  	if (event == NETDEV_UNREGISTER) {
> >  		fib_disable_ip(dev, 2);
> >@@ -1130,10 +1131,17 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
> >  		rt_cache_flush(net);
> >  		break;
> >  	case NETDEV_DOWN:
> >-		fib_disable_ip(dev, 0);
> >+		fib_disable_ip(dev, 1);
> >  		break;
> >-	case NETDEV_CHANGEMTU:
> >  	case NETDEV_CHANGE:
> >+		if (kill_routes_on_linkdown) {
> >+			flags = dev_get_flags(dev);
> >+			if (flags & (IFF_RUNNING|IFF_LOWER_UP))
> >+				fib_sync_up(dev);
> >+			else
> >+				fib_sync_down_dev(dev, 0);
> >+		}
> >+	case NETDEV_CHANGEMTU:
> >  		rt_cache_flush(net);
> >  		break;
> >  	}
> 
> I thought the value of 1 is already being used in fib_inetaddr_event.
You are correct.  I'll need to check the behavior to make sure this does
not break that (or put a check for kill_routes_on_linkdown) here
instead.

> >diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
> >index 5615198..d135ec9 100644
> >--- a/net/ipv4/fib_rules.c
> >+++ b/net/ipv4/fib_rules.c
> >@@ -49,9 +49,14 @@ struct fib4_rule {
> >
> >  int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res)
> >  {
> >+	return fib_lookup_flags(net, flp, res, 0);
> >+}
> >+
> >+int fib_lookup_flags(struct net *net, struct flowi4 *flp, struct fib_result *res, int flags)
> >+{
> >  	struct fib_lookup_arg arg = {
> >  		.result = res,
> >-		.flags = FIB_LOOKUP_NOREF,
> >+		.flags = FIB_LOOKUP_NOREF | flags,
> >  	};
> >  	int err;
> >
> 
> You would probably be better off forking this function so that you have a
> version that can lookup dead routes.  Since this is hot-path we don't wan to
> have to split this up over too many functions.
> 
> >diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> >index 28ec3c1..c0874ee 100644
> >--- a/net/ipv4/fib_semantics.c
> >+++ b/net/ipv4/fib_semantics.c
> >@@ -266,7 +266,7 @@ static inline int nh_comp(const struct fib_info *fi, const struct fib_info *ofi)
> >  #ifdef CONFIG_IP_ROUTE_CLASSID
> >  		    nh->nh_tclassid != onh->nh_tclassid ||
> >  #endif
> >-		    ((nh->nh_flags ^ onh->nh_flags) & ~RTNH_F_DEAD))
> >+		    ((nh->nh_flags ^ onh->nh_flags) & ~(RTNH_F_DEAD|RTNH_F_DEAD_LINKDOWN)))
> >  			return -1;
> >  		onh++;
> >  	} endfor_nexthops(fi);
> 
> You might just want to come up with a define to cover both dead cases. Maybe
> something like "RTNH_F_DEAD_ANY"
> 
> >@@ -318,7 +318,7 @@ static struct fib_info *fib_find_info(const struct fib_info *nfi)
> >  		    nfi->fib_type == fi->fib_type &&
> >  		    memcmp(nfi->fib_metrics, fi->fib_metrics,
> >  			   sizeof(u32) * RTAX_MAX) == 0 &&
> >-		    ((nfi->fib_flags ^ fi->fib_flags) & ~RTNH_F_DEAD) == 0 &&
> >+		    ((nfi->fib_flags ^ fi->fib_flags) & ~(RTNH_F_DEAD|RTNH_F_DEAD_LINKDOWN)) == 0 &&
> >  		    (nfi->fib_nhs == 0 || nh_comp(fi, nfi) == 0))
> >  			return fi;
> >  	}
> >@@ -604,6 +604,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
> >  				return -ENODEV;
> >  			if (!(dev->flags & IFF_UP))
> >  				return -ENETDOWN;
> >+			if (!netif_carrier_ok(dev) && kill_routes_on_linkdown)
> >+				nh->nh_flags |= RTNH_F_DEAD;
> >  			nh->nh_dev = dev;
> >  			dev_hold(dev);
> >  			nh->nh_scope = RT_SCOPE_LINK;
> 
> It seems like you really should be using the DEAD_LINKDOWN here since this
> is what you are trying to indicate.
> 
> Also it really seems like you should move kill_routes_on_linkdown to
> fib_table_lookup.  You should track the extra flag always, and only skip the
> routes if kill_routes_on_linkdown is set.  Then you can override that with
> the extra flag being passed to fib_table_lookup.
> 
> Actually now that I think about it you could probably use
> kill_routes_on_linkdown in __fib_lookup to set a flag indicating that you
> want to kill the routes, otherwise just use the default behavior.
This is a significantly different design, but I like this.  One of the
issues I had with this was the fact that it felt like there were lots of
checks for kill_routes_on_linkdown and this would eliminate the need to
have this check in many places.

I can play with the following if it seems compelling (and it seems that
this feature is desireable so I'm willing to give it a try):

- set RTNH_F_LINKDOWN (dropping the 'DEAD' from the string) on all
  routes with interfaces that have link-down

- move check for kill_routes_on_linkdown to fib_lookup path and see if
  I can differentiate the cases between when I'm trying to add a route
  on an interface and doing a lookup

- set RTNH_F_DEAD on all nexthops during a fib_dump_info() so older
  userspace tools will still know about the dead routes

> 
> >@@ -621,7 +623,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
> >  			/* It is not necessary, but requires a bit of thinking */
> >  			if (fl4.flowi4_scope < RT_SCOPE_LINK)
> >  				fl4.flowi4_scope = RT_SCOPE_LINK;
> >-			err = fib_lookup(net, &fl4, &res);
> >+			err = fib_lookup_flags(net, &fl4, &res, FIB_LOOKUP_ALLOWDEAD);
> >  			if (err) {
> >  				rcu_read_unlock();
> >  				return err;
> >@@ -636,6 +638,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
> >  		if (!dev)
> >  			goto out;
> >  		dev_hold(dev);
> >+		if (!netif_carrier_ok(dev) && kill_routes_on_linkdown)
> >+			nh->nh_flags |= RTNH_F_DEAD;
> >  		err = (dev->flags & IFF_UP) ? 0 : -ENETDOWN;
> >  	} else {
> >  		struct in_device *in_dev;
> 
> Same here.  There is no point in overloading RTNH_F_DEAD if you already are
> adding a new bit that will represent something similar.
The design decision for this mentioned above should probably make the
decision at least clear -- maybe not correct, but it should make more
sense.  :-)

> >@@ -654,6 +658,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
> >  		nh->nh_dev = in_dev->dev;
> >  		dev_hold(nh->nh_dev);
> >  		nh->nh_scope = RT_SCOPE_HOST;
> >+		if (!netif_carrier_ok(nh->nh_dev) && kill_routes_on_linkdown)
> >+			nh->nh_flags |= RTNH_F_DEAD;
> >  		err = 0;
> >  	}
> >  out:
> >@@ -760,6 +766,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
> >  	struct fib_info *ofi;
> >  	int nhs = 1;
> >  	struct net *net = cfg->fc_nlinfo.nl_net;
> >+	int dead;
> >
> >  	if (cfg->fc_type > RTN_MAX)
> >  		goto err_inval;
> >@@ -920,11 +927,18 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
> >  		if (!nh->nh_dev)
> >  			goto failure;
> >  	} else {
> >+		dead = 0;
> >  		change_nexthops(fi) {
> >  			err = fib_check_nh(cfg, fi, nexthop_nh);
> >  			if (err != 0)
> >  				goto failure;
> >+			if (nexthop_nh->nh_flags & RTNH_F_DEAD)
> >+				dead++;
> >  		} endfor_nexthops(fi)
> >+		if ((dead == fi->fib_nhs)) {
> >+			fi->fib_flags |= RTNH_F_DEAD;
> >+			fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
> >+		}
> >  	}
> >
> >  	if (fi->fib_prefsrc) {
> >@@ -1097,6 +1111,8 @@ int fib_sync_down_addr(struct net *net, __be32 local)
> >  			continue;
> >  		if (fi->fib_prefsrc == local) {
> >  			fi->fib_flags |= RTNH_F_DEAD;
> >+			/* Addr is gone, more serious than a linkdown */
> >+			fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
> >  			ret++;
> >  		}
> >  	}
> >@@ -1112,7 +1128,7 @@ int fib_sync_down_dev(struct net_device *dev, int force)
> >  	struct hlist_head *head = &fib_info_devhash[hash];
> >  	struct fib_nh *nh;
> >
> >-	if (force)
> >+	if (force > 1)
> >  		scope = -1;
> >
> >  	hlist_for_each_entry(nh, head, nh_hash) {
> >@@ -1147,6 +1163,11 @@ int fib_sync_down_dev(struct net_device *dev, int force)
> >  		} endfor_nexthops(fi)
> >  		if (dead == fi->fib_nhs) {
> >  			fi->fib_flags |= RTNH_F_DEAD;
> >+			/* force marks route down due to admin down and device removal. */
> >+			if (!force)
> >+				fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
> >+			else
> >+				fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
> >  			ret++;
> >  		}
> >  	}
> 
> So what is the idea behind changing the force value like this?  It seems
> like it would make this much more readable if you were to simply use a value
> such as -1 to handle your DEAD_LINKDOWN case rather than altering the
> behavior for the values 1 and 0.
Also a reasonable option.  The key is that I want to be able to
differentiate from a DEAD link that is due to a linkdown and DEAD due to
clearing of IFF_UP.

> 
> >@@ -1223,10 +1244,12 @@ int fib_sync_up(struct net_device *dev)
> >  	struct hlist_head *head;
> >  	struct fib_nh *nh;
> >  	int ret;
> >+	int link_up;
> >
> >  	if (!(dev->flags & IFF_UP))
> >  		return 0;
> >
> >+	link_up = netif_carrier_ok(dev);
> >  	prev_fi = NULL;
> >  	hash = fib_devindex_hashfn(dev->ifindex);
> >  	head = &fib_info_devhash[hash];
> >@@ -1253,16 +1276,27 @@ int fib_sync_up(struct net_device *dev)
> >  			if (nexthop_nh->nh_dev != dev ||
> >  			    !__in_dev_get_rtnl(dev))
> >  				continue;
> >-			alive++;
> >+			if (link_up) {
> >+				/* Link is up, so mark NH as alive */
> >+				nexthop_nh->nh_flags &= ~RTNH_F_DEAD;
> >+				alive++;
> >+			} else
> >+				nexthop_nh->nh_flags |= RTNH_F_DEAD;
> >+
> >  			spin_lock_bh(&fib_multipath_lock);
> >  			nexthop_nh->nh_power = 0;
> >-			nexthop_nh->nh_flags &= ~RTNH_F_DEAD;
> >  			spin_unlock_bh(&fib_multipath_lock);
> >  		} endfor_nexthops(fi)
> >
> >  		if (alive > 0) {
> >+			/* Some NHs are alive, unmark the route as dead */
> >+			fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
> >  			fi->fib_flags &= ~RTNH_F_DEAD;
> >  			ret++;
> >+		} else {
> >+			/* No NHs are alive, mark the route as dead */
> >+			fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
> >+			fi->fib_flags |= RTNH_F_DEAD;
> >  		}
> >  	}
> >
> >diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> >index 01bce15..eedf287 100644
> >--- a/net/ipv4/fib_trie.c
> >+++ b/net/ipv4/fib_trie.c
> >@@ -1401,12 +1401,18 @@ found:
> >  #endif
> >  			return err;
> >  		}
> >-		if (fi->fib_flags & RTNH_F_DEAD)
> >-			continue;
> >+		if (!(fib_flags & FIB_LOOKUP_ALLOWDEAD)) {
> >+			/* if route is dead and link is down, keep looking  */
> >+			if ((fi->fib_flags & RTNH_F_DEAD) &&
> >+			    (fi->fib_flags & RTNH_F_DEAD_LINKDOWN))
> >+				continue;
> >+		}
> 
> The overloading of things makes this bit confusing.  I would much rather
> have seen this as:
> 		if (fi->fib_flags & RTNH_F_DEAD)
> 			continue;
> 		if ((fi->fib_flags & RTNH_F_DEAD_LINKDOWN) &&
> 		    !(fib_flags & FIB_LOOKUP_ALLOWDEAD))
> 			continue;
> 
> >  		for (nhsel = 0; nhsel < fi->fib_nhs; nhsel++) {
> >  			const struct fib_nh *nh = &fi->fib_nh[nhsel];
> >
> >-			if (nh->nh_flags & RTNH_F_DEAD)
> >+			/* allow next-hop to be added if link is down */
> >+			if ((nh->nh_flags & RTNH_F_DEAD) &&
> >+			    !(fi->fib_flags & RTNH_F_DEAD_LINKDOWN))
> >  				continue;
> >  			if (flp->flowi4_oif && flp->flowi4_oif != nh->nh_oif)
> >  				continue;
> 
> Same here.  By combining the DEAD and DEAD_LINKDOWN the way they have been
> it is hard to tell exaclty what is going on.  It would be much easier to
> sort all of this out if DEAD was left as DEAD, and DEAD_LINKDOWN was handled
> as something similar with option to override.
I think this will be fine.  Clearly I would want to test it again first,
but I'd have no major issue with a cleanup like this.

> 
> >@@ -1829,7 +1835,12 @@ int fib_table_flush(struct fib_table *tb)
> >  		hlist_for_each_entry_safe(fa, tmp, &n->leaf, fa_list) {
> >  			struct fib_info *fi = fa->fa_info;
> >
> >-			if (!fi || !(fi->fib_flags & RTNH_F_DEAD)) {
> >+			/* DEAD and DEAD_LINKDOWN will not both be set
> >+			 * with IFF_UP is cleared, so do not flush
> >+			 * entries with only DEAD set
> >+			 */
> >+			if (!fi || !((fi->fib_flags & RTNH_F_DEAD) &&
> >+			    !(fi->fib_flags & RTNH_F_DEAD_LINKDOWN))) {
> >  				slen = fa->fa_slen;
> >  				continue;
> >  			}
> >
> 
> First you could do this flags check with just one and:
> 	if (!fi || !(fi->fib_flags & (RTNH_F_DEAD |
> 				      RTNH_F_DEAD_LINKDOWN)) {
> 
> Second I am not a fan of this flag messing with stuff such as the suffix
> length since we are now cut out of the search for just a link down event and
> the fact is link down/link up events can occur quickly and in significant
> quantaties int he case of a link bouncing.
> 
> Really what I would prefer to see is that this route gets ignored in
> fib_table_lookup in the case of a link down, and then we can avoid all of
> the ugly messing with RTNH_F_DEAD that seems to be happening as it makes it
> much more difficult to decode the actual state with the two flags.
That would be fine, but then a new way needs to be devised to report
this back to userspace to anyone who performs a dump of the fib table
knows that routes are dead.  One such option would be to have a check
in fib_dump_info() for link status and add the RTNH_F_DEAD flag (or
another flag if it is deemed that this is necessary and not too
disruptive to userspace on the way up.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa June 3, 2015, 8:04 p.m. UTC | #25
On Wed, Jun 3, 2015, at 20:27, Andy Gospodarek wrote:
> On Wed, Jun 03, 2015 at 11:15:55AM -0700, Scott Feldman wrote:
> > On Tue, Jun 2, 2015 at 8:07 PM, Andy Gospodarek
> > <gospo@cumulusnetworks.com> wrote:
> > > This feature is only enabled with the new sysctl set (default is off):
> > > net.core.kill_routes_on_linkdown = 1
> > 
> > One more thing, sorry.  This feature is typically implemented today in
> > user-space on a per-interface basis.  The example I'm thinking of is
> > Quagga's "link-detect" directive which goes on an interface.  Should
> > this be a bool on each interface in systcl?  That would let user not
> > enable on selected interfaces.
> That would not be my preference.
> 
> I'm willing to investigate the per-namespace support if Hannes would
> like and add switchdev support for v2, but would prefer this not become
> that granular.

Actually, this idea also came to my mind: flagging specific routes if
they are eligible to suppress if the link is down.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Gospodarek June 3, 2015, 8:11 p.m. UTC | #26
On Wed, Jun 03, 2015 at 12:03:53PM -0700, Scott Feldman wrote:
> On Wed, Jun 3, 2015 at 11:27 AM, Andy Gospodarek
> <gospo@cumulusnetworks.com> wrote:
> > On Wed, Jun 03, 2015 at 11:15:55AM -0700, Scott Feldman wrote:
> >> On Tue, Jun 2, 2015 at 8:07 PM, Andy Gospodarek
> >> <gospo@cumulusnetworks.com> wrote:
> >> > This feature is only enabled with the new sysctl set (default is off):
> >> > net.core.kill_routes_on_linkdown = 1
> >>
> >> One more thing, sorry.  This feature is typically implemented today in
> >> user-space on a per-interface basis.  The example I'm thinking of is
> >> Quagga's "link-detect" directive which goes on an interface.  Should
> >> this be a bool on each interface in systcl?  That would let user not
> >> enable on selected interfaces.
> > That would not be my preference.
> >
> > I'm willing to investigate the per-namespace support if Hannes would
> > like and add switchdev support for v2, but would prefer this not become
> > that granular.
> 
> 
> Is there a reason?  Seems you moving a feature from user space to the
> kernel, so I'm curious why you wouldn't want to mimic functionality
> that's available today.
To me it seemed like a feature that is more globally useful so I wanted
to just have it that way to start.  I'm not opposed to per-device
granularity in the future, but I did not want to differentiate right out
of the gate when there are enough other moving parts.

> I like the patch/feature, BTW.  Thanks for doing this.
Glad to hear it!  :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Gospodarek June 3, 2015, 8:34 p.m. UTC | #27
On Wed, Jun 03, 2015 at 10:04:06PM +0200, Hannes Frederic Sowa wrote:
> On Wed, Jun 3, 2015, at 20:27, Andy Gospodarek wrote:
> > On Wed, Jun 03, 2015 at 11:15:55AM -0700, Scott Feldman wrote:
> > > On Tue, Jun 2, 2015 at 8:07 PM, Andy Gospodarek
> > > <gospo@cumulusnetworks.com> wrote:
> > > > This feature is only enabled with the new sysctl set (default is off):
> > > > net.core.kill_routes_on_linkdown = 1
> > > 
> > > One more thing, sorry.  This feature is typically implemented today in
> > > user-space on a per-interface basis.  The example I'm thinking of is
> > > Quagga's "link-detect" directive which goes on an interface.  Should
> > > this be a bool on each interface in systcl?  That would let user not
> > > enable on selected interfaces.
> > That would not be my preference.
> > 
> > I'm willing to investigate the per-namespace support if Hannes would
> > like and add switchdev support for v2, but would prefer this not become
> > that granular.
> 
> Actually, this idea also came to my mind: flagging specific routes if
> they are eligible to suppress if the link is down.
I'll have to think about how this would integrate with the alternative
design suggestion earlier in the thread from Alex.

Even if I do not implement this right away it might be nice to add the
ability to mark routes as permanent if the user desired.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa June 3, 2015, 8:36 p.m. UTC | #28
On Wed, Jun 3, 2015, at 22:34, Andy Gospodarek wrote:
> On Wed, Jun 03, 2015 at 10:04:06PM +0200, Hannes Frederic Sowa wrote:
> > On Wed, Jun 3, 2015, at 20:27, Andy Gospodarek wrote:
> > > On Wed, Jun 03, 2015 at 11:15:55AM -0700, Scott Feldman wrote:
> > > > On Tue, Jun 2, 2015 at 8:07 PM, Andy Gospodarek
> > > > <gospo@cumulusnetworks.com> wrote:
> > > > > This feature is only enabled with the new sysctl set (default is off):
> > > > > net.core.kill_routes_on_linkdown = 1
> > > > 
> > > > One more thing, sorry.  This feature is typically implemented today in
> > > > user-space on a per-interface basis.  The example I'm thinking of is
> > > > Quagga's "link-detect" directive which goes on an interface.  Should
> > > > this be a bool on each interface in systcl?  That would let user not
> > > > enable on selected interfaces.
> > > That would not be my preference.
> > > 
> > > I'm willing to investigate the per-namespace support if Hannes would
> > > like and add switchdev support for v2, but would prefer this not become
> > > that granular.
> > 
> > Actually, this idea also came to my mind: flagging specific routes if
> > they are eligible to suppress if the link is down.
> I'll have to think about how this would integrate with the alternative
> design suggestion earlier in the thread from Alex.
> 
> Even if I do not implement this right away it might be nice to add the
> ability to mark routes as permanent if the user desired.

If you do a per-route flag you would need a newer iproute binary in any
case, so you could apply Alex feedback more easily. :}

Bye,
Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander H Duyck June 3, 2015, 9:01 p.m. UTC | #29
On 06/03/2015 01:02 PM, Andy Gospodarek wrote:
> On Wed, Jun 03, 2015 at 11:25:09AM -0700, Alexander Duyck wrote:
>> On 06/02/2015 08:07 PM, Andy Gospodarek wrote:
> [...]
>>> Though there were some that preferred not to have a configuration option
>>> and to make this behavior the default when it was discussed in Ottawa
>>> earlier this year since "it was time to do this."  I wanted to propose
>>> the config option to preserve the current behavior for those that desire
>>> it.  I'll happily remove it if Dave and Linus approve.
>>>
>>> An IPv6 implementation is also needed (DECnet too!), but I wanted to
>>> start with the IPv4 implementation to get people comfortable with the
>>> idea before moving forward.  If this is accepted the IPv6 implementation
>>> can be posted shortly.
>>>
>>> FWIW, we have been running this patch with the sysctl setting above and
>>> our customers have been happily using a backported version for IPv4 a
>>> IPv6 for >6 months.
>> Really the patch below doesn't completely jive with what you have stated in
>> the patch description above.  I would have really much rather seen the
>> DEAD_LINKDOWN be the only behavior you changed.  Instead there are a number
>> of changes to the DEAD flag that I am not sure are really necessary.
> The main reason for the overload (which seems to be the source of most
> of your comments), was to allow routes to be reported back as dead to
> userspace with no modification to iproute2 and friends since there is
> already the ability to print that a route is dead and that implication
> is pretty clear.

The thing is in a way that is breaking user-space since dead now doesn't 
mean dead.  You have to try and decode it based on the flag value.

I'll add Stephen to the CC to get his thoughts on this.

> The other thing I wanted to avoid is to have users update their kernel
> (and the sysctl option) while not updating userspace and then have no
> real clue why routes are not being selected since their iproute2 output
> would not look any different.  It seems preservation of userspace seems
> paramount, even in the face of a different config of the kernel.

Well the other bit in this is that I am not sure how your change behaves 
in the face of something like tie CONFIG_IP_MULTIPLE_TABLES not being 
defined.  For example __fib_lookup is defined in fib_rules and most of 
that code goes unused when advanced router is disabled.

[...]
>>> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
>>> index 95b6139..fef1804 100644
>>> --- a/net/core/sysctl_net_core.c
>>> +++ b/net/core/sysctl_net_core.c
>>> @@ -392,6 +392,13 @@ static struct ctl_table net_core_table[] = {
>>>   		.mode		= 0644,
>>>   		.proc_handler	= proc_dointvec
>>>   	},
>>> +	{
>>> +		.procname	= "kill_routes_on_linkdown",
>>> +		.data		= &kill_routes_on_linkdown,
>>> +		.maxlen		= sizeof(int),
>>> +		.mode		= 0644,
>>> +		.proc_handler	= proc_dointvec
>>> +	},
>>>   	{ }
>>>   };
>>>
>>> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
>>> index 872494e..94348c7 100644
>>> --- a/net/ipv4/fib_frontend.c
>>> +++ b/net/ipv4/fib_frontend.c
>>> @@ -1107,6 +1107,7 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
>>>   	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
>>>   	struct in_device *in_dev;
>>>   	struct net *net = dev_net(dev);
>>> +	unsigned flags;
>>>
>>>   	if (event == NETDEV_UNREGISTER) {
>>>   		fib_disable_ip(dev, 2);
>>> @@ -1130,10 +1131,17 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
>>>   		rt_cache_flush(net);
>>>   		break;
>>>   	case NETDEV_DOWN:
>>> -		fib_disable_ip(dev, 0);
>>> +		fib_disable_ip(dev, 1);
>>>   		break;
>>> -	case NETDEV_CHANGEMTU:
>>>   	case NETDEV_CHANGE:
>>> +		if (kill_routes_on_linkdown) {
>>> +			flags = dev_get_flags(dev);
>>> +			if (flags & (IFF_RUNNING|IFF_LOWER_UP))
>>> +				fib_sync_up(dev);
>>> +			else
>>> +				fib_sync_down_dev(dev, 0);
>>> +		}
>>> +	case NETDEV_CHANGEMTU:
>>>   		rt_cache_flush(net);
>>>   		break;
>>>   	}
>> I thought the value of 1 is already being used in fib_inetaddr_event.
> You are correct.  I'll need to check the behavior to make sure this does
> not break that (or put a check for kill_routes_on_linkdown) here
> instead.
>
>>> diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
>>> index 5615198..d135ec9 100644
>>> --- a/net/ipv4/fib_rules.c
>>> +++ b/net/ipv4/fib_rules.c
>>> @@ -49,9 +49,14 @@ struct fib4_rule {
>>>
>>>   int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res)
>>>   {
>>> +	return fib_lookup_flags(net, flp, res, 0);
>>> +}
>>> +
>>> +int fib_lookup_flags(struct net *net, struct flowi4 *flp, struct fib_result *res, int flags)
>>> +{
>>>   	struct fib_lookup_arg arg = {
>>>   		.result = res,
>>> -		.flags = FIB_LOOKUP_NOREF,
>>> +		.flags = FIB_LOOKUP_NOREF | flags,
>>>   	};
>>>   	int err;
>>>
>> You would probably be better off forking this function so that you have a
>> version that can lookup dead routes.  Since this is hot-path we don't wan to
>> have to split this up over too many functions.

Actually there is another issue here.  This function is only called from 
fib_lookup if custom rules are enabled.  It seems like without custom 
rules enabled you would just be calling fib_table_lookup and in that 
case you cannot ever pass the flag through.

>>
>>> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
>>> index 28ec3c1..c0874ee 100644
>>> --- a/net/ipv4/fib_semantics.c
>>> +++ b/net/ipv4/fib_semantics.c
>>> @@ -266,7 +266,7 @@ static inline int nh_comp(const struct fib_info *fi, const struct fib_info *ofi)
>>>   #ifdef CONFIG_IP_ROUTE_CLASSID
>>>   		    nh->nh_tclassid != onh->nh_tclassid ||
>>>   #endif
>>> -		    ((nh->nh_flags ^ onh->nh_flags) & ~RTNH_F_DEAD))
>>> +		    ((nh->nh_flags ^ onh->nh_flags) & ~(RTNH_F_DEAD|RTNH_F_DEAD_LINKDOWN)))
>>>   			return -1;
>>>   		onh++;
>>>   	} endfor_nexthops(fi);
>> You might just want to come up with a define to cover both dead cases. Maybe
>> something like "RTNH_F_DEAD_ANY"
>>
>>> @@ -318,7 +318,7 @@ static struct fib_info *fib_find_info(const struct fib_info *nfi)
>>>   		    nfi->fib_type == fi->fib_type &&
>>>   		    memcmp(nfi->fib_metrics, fi->fib_metrics,
>>>   			   sizeof(u32) * RTAX_MAX) == 0 &&
>>> -		    ((nfi->fib_flags ^ fi->fib_flags) & ~RTNH_F_DEAD) == 0 &&
>>> +		    ((nfi->fib_flags ^ fi->fib_flags) & ~(RTNH_F_DEAD|RTNH_F_DEAD_LINKDOWN)) == 0 &&
>>>   		    (nfi->fib_nhs == 0 || nh_comp(fi, nfi) == 0))
>>>   			return fi;
>>>   	}
>>> @@ -604,6 +604,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
>>>   				return -ENODEV;
>>>   			if (!(dev->flags & IFF_UP))
>>>   				return -ENETDOWN;
>>> +			if (!netif_carrier_ok(dev) && kill_routes_on_linkdown)
>>> +				nh->nh_flags |= RTNH_F_DEAD;
>>>   			nh->nh_dev = dev;
>>>   			dev_hold(dev);
>>>   			nh->nh_scope = RT_SCOPE_LINK;
>> It seems like you really should be using the DEAD_LINKDOWN here since this
>> is what you are trying to indicate.
>>
>> Also it really seems like you should move kill_routes_on_linkdown to
>> fib_table_lookup.  You should track the extra flag always, and only skip the
>> routes if kill_routes_on_linkdown is set.  Then you can override that with
>> the extra flag being passed to fib_table_lookup.
>>
>> Actually now that I think about it you could probably use
>> kill_routes_on_linkdown in __fib_lookup to set a flag indicating that you
>> want to kill the routes, otherwise just use the default behavior.
> This is a significantly different design, but I like this.  One of the
> issues I had with this was the fact that it felt like there were lots of
> checks for kill_routes_on_linkdown and this would eliminate the need to
> have this check in many places.
>
> I can play with the following if it seems compelling (and it seems that
> this feature is desireable so I'm willing to give it a try):
>
> - set RTNH_F_LINKDOWN (dropping the 'DEAD' from the string) on all
>    routes with interfaces that have link-down
>
> - move check for kill_routes_on_linkdown to fib_lookup path and see if
>    I can differentiate the cases between when I'm trying to add a route
>    on an interface and doing a lookup
>
> - set RTNH_F_DEAD on all nexthops during a fib_dump_info() so older
>    userspace tools will still know about the dead routes

Yes, I think that implementation would be much cleaner.  I'm still not 
sure if you should be setting RTNH_F_DEAD if RTNH_F_LINKDOWN is set in 
fib_dump_info but that is really up for Stephen to deal with since it is 
his userspace that will have to deal with that.

>
>>> @@ -621,7 +623,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
>>>   			/* It is not necessary, but requires a bit of thinking */
>>>   			if (fl4.flowi4_scope < RT_SCOPE_LINK)
>>>   				fl4.flowi4_scope = RT_SCOPE_LINK;
>>> -			err = fib_lookup(net, &fl4, &res);
>>> +			err = fib_lookup_flags(net, &fl4, &res, FIB_LOOKUP_ALLOWDEAD);
>>>   			if (err) {
>>>   				rcu_read_unlock();
>>>   				return err;

I just noticed this piece.  You replace fib_lookup with 
fib_lookup_flags.  I don't think the two are equivalent since what you 
modified was __fib_lookup.  Also there are two implementations of 
fib_lookup, one for the multiple tables and one for the single table 
implementation.  You need to make sure both work the same after your patch.

It might also be worthwhile to make it so that the lookup flag opts into 
dropping the down link instead of opting into it.  One way to approach 
this would be to make both fib_lookup and __fib_lookup accept a flags 
value.  Then you could OR in the FIB_LOOKUP_NOREF in fib_lookup, and 
assign the value directly to the arg.flags in __fib_lookup, and strip 
the FIB_LOOKUP_DROPDEAD (yes I am flipping the logic here so I renamed 
it) in fib_table_lookup if the sysctl indicates the option is disabled.  
In fact you could probably convert the bit stripping in fib_table_lookup 
to a static_key based flag similar to rps_needed.

You might also want to split this into two patches.  One for 
setting/clearing the LINKDOWN bit, and one for all of the fib_lookup 
changes.

>>> @@ -636,6 +638,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
>>>   		if (!dev)
>>>   			goto out;
>>>   		dev_hold(dev);
>>> +		if (!netif_carrier_ok(dev) && kill_routes_on_linkdown)
>>> +			nh->nh_flags |= RTNH_F_DEAD;
>>>   		err = (dev->flags & IFF_UP) ? 0 : -ENETDOWN;
>>>   	} else {
>>>   		struct in_device *in_dev;
>> Same here.  There is no point in overloading RTNH_F_DEAD if you already are
>> adding a new bit that will represent something similar.
> The design decision for this mentioned above should probably make the
> decision at least clear -- maybe not correct, but it should make more
> sense.  :-)

Agreed.

>
>>> @@ -654,6 +658,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
>>>   		nh->nh_dev = in_dev->dev;
>>>   		dev_hold(nh->nh_dev);
>>>   		nh->nh_scope = RT_SCOPE_HOST;
>>> +		if (!netif_carrier_ok(nh->nh_dev) && kill_routes_on_linkdown)
>>> +			nh->nh_flags |= RTNH_F_DEAD;
>>>   		err = 0;
>>>   	}
>>>   out:
>>> @@ -760,6 +766,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
>>>   	struct fib_info *ofi;
>>>   	int nhs = 1;
>>>   	struct net *net = cfg->fc_nlinfo.nl_net;
>>> +	int dead;
>>>
>>>   	if (cfg->fc_type > RTN_MAX)
>>>   		goto err_inval;
>>> @@ -920,11 +927,18 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
>>>   		if (!nh->nh_dev)
>>>   			goto failure;
>>>   	} else {
>>> +		dead = 0;
>>>   		change_nexthops(fi) {
>>>   			err = fib_check_nh(cfg, fi, nexthop_nh);
>>>   			if (err != 0)
>>>   				goto failure;
>>> +			if (nexthop_nh->nh_flags & RTNH_F_DEAD)
>>> +				dead++;
>>>   		} endfor_nexthops(fi)
>>> +		if ((dead == fi->fib_nhs)) {
>>> +			fi->fib_flags |= RTNH_F_DEAD;
>>> +			fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
>>> +		}
>>>   	}
>>>
>>>   	if (fi->fib_prefsrc) {
>>> @@ -1097,6 +1111,8 @@ int fib_sync_down_addr(struct net *net, __be32 local)
>>>   			continue;
>>>   		if (fi->fib_prefsrc == local) {
>>>   			fi->fib_flags |= RTNH_F_DEAD;
>>> +			/* Addr is gone, more serious than a linkdown */
>>> +			fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
>>>   			ret++;
>>>   		}
>>>   	}
>>> @@ -1112,7 +1128,7 @@ int fib_sync_down_dev(struct net_device *dev, int force)
>>>   	struct hlist_head *head = &fib_info_devhash[hash];
>>>   	struct fib_nh *nh;
>>>
>>> -	if (force)
>>> +	if (force > 1)
>>>   		scope = -1;
>>>
>>>   	hlist_for_each_entry(nh, head, nh_hash) {
>>> @@ -1147,6 +1163,11 @@ int fib_sync_down_dev(struct net_device *dev, int force)
>>>   		} endfor_nexthops(fi)
>>>   		if (dead == fi->fib_nhs) {
>>>   			fi->fib_flags |= RTNH_F_DEAD;
>>> +			/* force marks route down due to admin down and device removal. */
>>> +			if (!force)
>>> +				fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
>>> +			else
>>> +				fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
>>>   			ret++;
>>>   		}
>>>   	}
>> So what is the idea behind changing the force value like this?  It seems
>> like it would make this much more readable if you were to simply use a value
>> such as -1 to handle your DEAD_LINKDOWN case rather than altering the
>> behavior for the values 1 and 0.
> Also a reasonable option.  The key is that I want to be able to
> differentiate from a DEAD link that is due to a linkdown and DEAD due to
> clearing of IFF_UP.

Yes, and I agree that if the bits are separated so that LINKDOWN is just 
mapped to DEAD on fib_dump_info then it makes it must easier to read.

>
>>> @@ -1223,10 +1244,12 @@ int fib_sync_up(struct net_device *dev)
>>>   	struct hlist_head *head;
>>>   	struct fib_nh *nh;
>>>   	int ret;
>>> +	int link_up;
>>>
>>>   	if (!(dev->flags & IFF_UP))
>>>   		return 0;
>>>
>>> +	link_up = netif_carrier_ok(dev);
>>>   	prev_fi = NULL;
>>>   	hash = fib_devindex_hashfn(dev->ifindex);
>>>   	head = &fib_info_devhash[hash];
>>> @@ -1253,16 +1276,27 @@ int fib_sync_up(struct net_device *dev)
>>>   			if (nexthop_nh->nh_dev != dev ||
>>>   			    !__in_dev_get_rtnl(dev))
>>>   				continue;
>>> -			alive++;
>>> +			if (link_up) {
>>> +				/* Link is up, so mark NH as alive */
>>> +				nexthop_nh->nh_flags &= ~RTNH_F_DEAD;
>>> +				alive++;
>>> +			} else
>>> +				nexthop_nh->nh_flags |= RTNH_F_DEAD;
>>> +
>>>   			spin_lock_bh(&fib_multipath_lock);
>>>   			nexthop_nh->nh_power = 0;
>>> -			nexthop_nh->nh_flags &= ~RTNH_F_DEAD;
>>>   			spin_unlock_bh(&fib_multipath_lock);
>>>   		} endfor_nexthops(fi)
>>>
>>>   		if (alive > 0) {
>>> +			/* Some NHs are alive, unmark the route as dead */
>>> +			fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
>>>   			fi->fib_flags &= ~RTNH_F_DEAD;
>>>   			ret++;
>>> +		} else {
>>> +			/* No NHs are alive, mark the route as dead */
>>> +			fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
>>> +			fi->fib_flags |= RTNH_F_DEAD;
>>>   		}
>>>   	}
>>>
>>> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>>> index 01bce15..eedf287 100644
>>> --- a/net/ipv4/fib_trie.c
>>> +++ b/net/ipv4/fib_trie.c
>>> @@ -1401,12 +1401,18 @@ found:
>>>   #endif
>>>   			return err;
>>>   		}
>>> -		if (fi->fib_flags & RTNH_F_DEAD)
>>> -			continue;
>>> +		if (!(fib_flags & FIB_LOOKUP_ALLOWDEAD)) {
>>> +			/* if route is dead and link is down, keep looking  */
>>> +			if ((fi->fib_flags & RTNH_F_DEAD) &&
>>> +			    (fi->fib_flags & RTNH_F_DEAD_LINKDOWN))
>>> +				continue;
>>> +		}
>> The overloading of things makes this bit confusing.  I would much rather
>> have seen this as:
>> 		if (fi->fib_flags & RTNH_F_DEAD)
>> 			continue;
>> 		if ((fi->fib_flags & RTNH_F_DEAD_LINKDOWN) &&
>> 		    !(fib_flags & FIB_LOOKUP_ALLOWDEAD))
>> 			continue;
>>
>>>   		for (nhsel = 0; nhsel < fi->fib_nhs; nhsel++) {
>>>   			const struct fib_nh *nh = &fi->fib_nh[nhsel];
>>>
>>> -			if (nh->nh_flags & RTNH_F_DEAD)
>>> +			/* allow next-hop to be added if link is down */
>>> +			if ((nh->nh_flags & RTNH_F_DEAD) &&
>>> +			    !(fi->fib_flags & RTNH_F_DEAD_LINKDOWN))
>>>   				continue;
>>>   			if (flp->flowi4_oif && flp->flowi4_oif != nh->nh_oif)
>>>   				continue;
>> Same here.  By combining the DEAD and DEAD_LINKDOWN the way they have been
>> it is hard to tell exaclty what is going on.  It would be much easier to
>> sort all of this out if DEAD was left as DEAD, and DEAD_LINKDOWN was handled
>> as something similar with option to override.
> I think this will be fine.  Clearly I would want to test it again first,
> but I'd have no major issue with a cleanup like this.
>
>>> @@ -1829,7 +1835,12 @@ int fib_table_flush(struct fib_table *tb)
>>>   		hlist_for_each_entry_safe(fa, tmp, &n->leaf, fa_list) {
>>>   			struct fib_info *fi = fa->fa_info;
>>>
>>> -			if (!fi || !(fi->fib_flags & RTNH_F_DEAD)) {
>>> +			/* DEAD and DEAD_LINKDOWN will not both be set
>>> +			 * with IFF_UP is cleared, so do not flush
>>> +			 * entries with only DEAD set
>>> +			 */
>>> +			if (!fi || !((fi->fib_flags & RTNH_F_DEAD) &&
>>> +			    !(fi->fib_flags & RTNH_F_DEAD_LINKDOWN))) {
>>>   				slen = fa->fa_slen;
>>>   				continue;
>>>   			}
>>>
>> First you could do this flags check with just one and:
>> 	if (!fi || !(fi->fib_flags & (RTNH_F_DEAD |
>> 				      RTNH_F_DEAD_LINKDOWN)) {
>>
>> Second I am not a fan of this flag messing with stuff such as the suffix
>> length since we are now cut out of the search for just a link down event and
>> the fact is link down/link up events can occur quickly and in significant
>> quantaties int he case of a link bouncing.
>>
>> Really what I would prefer to see is that this route gets ignored in
>> fib_table_lookup in the case of a link down, and then we can avoid all of
>> the ugly messing with RTNH_F_DEAD that seems to be happening as it makes it
>> much more difficult to decode the actual state with the two flags.
> That would be fine, but then a new way needs to be devised to report
> this back to userspace to anyone who performs a dump of the fib table
> knows that routes are dead.  One such option would be to have a check
> in fib_dump_info() for link status and add the RTNH_F_DEAD flag (or
> another flag if it is deemed that this is necessary and not too
> disruptive to userspace on the way up.

I'd leave this up to Stephen.  Actually implementing that bit should be 
straight forward since it is just a shift, AND, OR to get the DEAD bit 
to mirror the LINKDOWN bit.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Gospodarek June 5, 2015, 7:05 p.m. UTC | #30
On Wed, Jun 03, 2015 at 02:01:24PM -0700, Alexander Duyck wrote:
> On 06/03/2015 01:02 PM, Andy Gospodarek wrote:
[...]
> >
> >>>@@ -621,7 +623,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
> >>>  			/* It is not necessary, but requires a bit of thinking */
> >>>  			if (fl4.flowi4_scope < RT_SCOPE_LINK)
> >>>  				fl4.flowi4_scope = RT_SCOPE_LINK;
> >>>-			err = fib_lookup(net, &fl4, &res);
> >>>+			err = fib_lookup_flags(net, &fl4, &res, FIB_LOOKUP_ALLOWDEAD);
> >>>  			if (err) {
> >>>  				rcu_read_unlock();
> >>>  				return err;
> 
> I just noticed this piece.  You replace fib_lookup with fib_lookup_flags.  I
> don't think the two are equivalent since what you modified was __fib_lookup.
> Also there are two implementations of fib_lookup, one for the multiple
> tables and one for the single table implementation.  You need to make sure
> both work the same after your patch.
> 
> It might also be worthwhile to make it so that the lookup flag opts into
> dropping the down link instead of opting into it.  One way to approach this
> would be to make both fib_lookup and __fib_lookup accept a flags value.
> Then you could OR in the FIB_LOOKUP_NOREF in fib_lookup, and assign the
> value directly to the arg.flags in __fib_lookup, and strip the
> FIB_LOOKUP_DROPDEAD (yes I am flipping the logic here so I renamed it) in
> fib_table_lookup if the sysctl indicates the option is disabled.  In fact
> you could probably convert the bit stripping in fib_table_lookup to a
> static_key based flag similar to rps_needed.
> 
> You might also want to split this into two patches.  One for
> setting/clearing the LINKDOWN bit, and one for all of the fib_lookup
> changes.

So I think I've addressed most of your concerns (as well as Scott's
concern about a per-device configuration), and have it ready to go.  I've
also cleaned up other items that had odd/confusing logic so they will
hopefully be easier to maintain.

The switchdev bits are not done, but that should be terrible -- despite
the fact that right now no NH flag changes made inside the kernel are
communicated to offload devices.

[...]
> >That would be fine, but then a new way needs to be devised to report
> >this back to userspace to anyone who performs a dump of the fib table
> >knows that routes are dead.  One such option would be to have a check
> >in fib_dump_info() for link status and add the RTNH_F_DEAD flag (or
> >another flag if it is deemed that this is necessary and not too
> >disruptive to userspace on the way up.
> 
> I'd leave this up to Stephen.  Actually implementing that bit should be
> straight forward since it is just a shift, AND, OR to get the DEAD bit to
> mirror the LINKDOWN bit.

I've still not heard anything from Stephen and I'd be curious if he
wants to weigh in on this....

My preference for user output is for the normal output to users who have
not set any of these sysctl bits to see this output with iproute2 when
p8p1 is linkdown:

# ip route show
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 linkdown 
90.0.0.0/24 
	nexthop via 80.0.0.2  dev p8p1 weight 1 linkdown
	nexthop via 70.0.0.2  dev p7p1 weight 1

This would simply indicate state, it does not take an action based on
that flag.  If the sysctl is set we would see the following:

# ip route show
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown 
90.0.0.0/24 
	nexthop via 80.0.0.2  dev p8p1 weight 1 dead linkdown
	nexthop via 70.0.0.2  dev p7p1 weight 1

This allows someone who is only querying via netlink to know that the
route would not be used by the kernel due to the link being down (bits
for DEAD & LINKDOWN set).  This is separate from the nexthop not being
used as the interface is administratively down (bits for DEAD set).

FWIW, the kernel will not set these bits at the same time.  There really
is not a need since we are keying off the sysctl and LINKDOWN and the
logic is much cleaner and easier to understand that way.  This would be
similar to the way dev_get_flags adds bits to GETLINK callers without
having the kernel store all of those values.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6f5f71f..5bd953c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2986,6 +2986,7 @@  int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
 bool is_skb_forwardable(struct net_device *dev, struct sk_buff *skb);
 
 extern int		netdev_budget;
+extern int		kill_routes_on_linkdown;
 
 /* Called by rtnetlink.c:rtnl_unlock() */
 void netdev_run_todo(void);
diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index 6d67383..4fbfda5 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -37,6 +37,7 @@  struct fib_lookup_arg {
 	struct fib_rule		*rule;
 	int			flags;
 #define FIB_LOOKUP_NOREF	1
+#define FIB_LOOKUP_ALLOWDEAD	2
 };
 
 struct fib_rules_ops {
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 54271ed..efb195b 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -250,6 +250,7 @@  struct fib_table *fib_new_table(struct net *net, u32 id);
 struct fib_table *fib_get_table(struct net *net, u32 id);
 
 int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res);
+int fib_lookup_flags(struct net *n, struct flowi4 *flp, struct fib_result *res, int flags);
 
 static inline int fib_lookup(struct net *net, struct flowi4 *flp,
 			     struct fib_result *res)
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 17fb02f..bc264d4 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -338,6 +338,7 @@  struct rtnexthop {
 #define RTNH_F_PERVASIVE	2	/* Do recursive gateway lookup	*/
 #define RTNH_F_ONLINK		4	/* Gateway is forced on link	*/
 #define RTNH_F_OFFLOAD		8	/* offloaded route */
+#define RTNH_F_DEAD_LINKDOWN	16	/* Route dead due to carrier down */
 
 /* Macros to handle hexthops */
 
diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index 0956373..b6632dc 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -276,6 +276,7 @@  enum
 	NET_CORE_AEVENT_ETIME=20,
 	NET_CORE_AEVENT_RSEQTH=21,
 	NET_CORE_WARNINGS=22,
+	NET_CORE_KILL_ROUTES_ON_LINKDOWN=23,
 };
 
 /* /proc/sys/net/ethernet */
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 7e7746a..f319116 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -196,6 +196,7 @@  static const struct bin_table bin_net_core_table[] = {
 	{ CTL_INT,	NET_CORE_AEVENT_ETIME,	"xfrm_aevent_etime" },
 	{ CTL_INT,	NET_CORE_AEVENT_RSEQTH,	"xfrm_aevent_rseqth" },
 	{ CTL_INT,	NET_CORE_WARNINGS,	"warnings" },
+	{ CTL_INT,	NET_CORE_KILL_ROUTES_ON_LINKDOWN,	"kill_routes_on_linkdown" },
 	{},
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 0602e91..50dc19c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3167,6 +3167,8 @@  EXPORT_SYMBOL(netdev_max_backlog);
 int netdev_tstamp_prequeue __read_mostly = 1;
 int netdev_budget __read_mostly = 300;
 int weight_p __read_mostly = 64;            /* old backlog weight */
+int kill_routes_on_linkdown = 0;
+EXPORT_SYMBOL(kill_routes_on_linkdown);
 
 /* Called with irq disabled */
 static inline void ____napi_schedule(struct softnet_data *sd,
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 95b6139..fef1804 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -392,6 +392,13 @@  static struct ctl_table net_core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "kill_routes_on_linkdown",
+		.data		= &kill_routes_on_linkdown,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
 	{ }
 };
 
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 872494e..94348c7 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1107,6 +1107,7 @@  static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
 	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
 	struct in_device *in_dev;
 	struct net *net = dev_net(dev);
+	unsigned flags;
 
 	if (event == NETDEV_UNREGISTER) {
 		fib_disable_ip(dev, 2);
@@ -1130,10 +1131,17 @@  static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
 		rt_cache_flush(net);
 		break;
 	case NETDEV_DOWN:
-		fib_disable_ip(dev, 0);
+		fib_disable_ip(dev, 1);
 		break;
-	case NETDEV_CHANGEMTU:
 	case NETDEV_CHANGE:
+		if (kill_routes_on_linkdown) {
+			flags = dev_get_flags(dev);
+			if (flags & (IFF_RUNNING|IFF_LOWER_UP))
+				fib_sync_up(dev);
+			else
+				fib_sync_down_dev(dev, 0);
+		}
+	case NETDEV_CHANGEMTU:
 		rt_cache_flush(net);
 		break;
 	}
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index 5615198..d135ec9 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -49,9 +49,14 @@  struct fib4_rule {
 
 int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res)
 {
+	return fib_lookup_flags(net, flp, res, 0);
+}
+
+int fib_lookup_flags(struct net *net, struct flowi4 *flp, struct fib_result *res, int flags)
+{
 	struct fib_lookup_arg arg = {
 		.result = res,
-		.flags = FIB_LOOKUP_NOREF,
+		.flags = FIB_LOOKUP_NOREF | flags,
 	};
 	int err;
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 28ec3c1..c0874ee 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -266,7 +266,7 @@  static inline int nh_comp(const struct fib_info *fi, const struct fib_info *ofi)
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		    nh->nh_tclassid != onh->nh_tclassid ||
 #endif
-		    ((nh->nh_flags ^ onh->nh_flags) & ~RTNH_F_DEAD))
+		    ((nh->nh_flags ^ onh->nh_flags) & ~(RTNH_F_DEAD|RTNH_F_DEAD_LINKDOWN)))
 			return -1;
 		onh++;
 	} endfor_nexthops(fi);
@@ -318,7 +318,7 @@  static struct fib_info *fib_find_info(const struct fib_info *nfi)
 		    nfi->fib_type == fi->fib_type &&
 		    memcmp(nfi->fib_metrics, fi->fib_metrics,
 			   sizeof(u32) * RTAX_MAX) == 0 &&
-		    ((nfi->fib_flags ^ fi->fib_flags) & ~RTNH_F_DEAD) == 0 &&
+		    ((nfi->fib_flags ^ fi->fib_flags) & ~(RTNH_F_DEAD|RTNH_F_DEAD_LINKDOWN)) == 0 &&
 		    (nfi->fib_nhs == 0 || nh_comp(fi, nfi) == 0))
 			return fi;
 	}
@@ -604,6 +604,8 @@  static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 				return -ENODEV;
 			if (!(dev->flags & IFF_UP))
 				return -ENETDOWN;
+			if (!netif_carrier_ok(dev) && kill_routes_on_linkdown)
+				nh->nh_flags |= RTNH_F_DEAD;
 			nh->nh_dev = dev;
 			dev_hold(dev);
 			nh->nh_scope = RT_SCOPE_LINK;
@@ -621,7 +623,7 @@  static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 			/* It is not necessary, but requires a bit of thinking */
 			if (fl4.flowi4_scope < RT_SCOPE_LINK)
 				fl4.flowi4_scope = RT_SCOPE_LINK;
-			err = fib_lookup(net, &fl4, &res);
+			err = fib_lookup_flags(net, &fl4, &res, FIB_LOOKUP_ALLOWDEAD);
 			if (err) {
 				rcu_read_unlock();
 				return err;
@@ -636,6 +638,8 @@  static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 		if (!dev)
 			goto out;
 		dev_hold(dev);
+		if (!netif_carrier_ok(dev) && kill_routes_on_linkdown)
+			nh->nh_flags |= RTNH_F_DEAD;
 		err = (dev->flags & IFF_UP) ? 0 : -ENETDOWN;
 	} else {
 		struct in_device *in_dev;
@@ -654,6 +658,8 @@  static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 		nh->nh_dev = in_dev->dev;
 		dev_hold(nh->nh_dev);
 		nh->nh_scope = RT_SCOPE_HOST;
+		if (!netif_carrier_ok(nh->nh_dev) && kill_routes_on_linkdown)
+			nh->nh_flags |= RTNH_F_DEAD;
 		err = 0;
 	}
 out:
@@ -760,6 +766,7 @@  struct fib_info *fib_create_info(struct fib_config *cfg)
 	struct fib_info *ofi;
 	int nhs = 1;
 	struct net *net = cfg->fc_nlinfo.nl_net;
+	int dead;
 
 	if (cfg->fc_type > RTN_MAX)
 		goto err_inval;
@@ -920,11 +927,18 @@  struct fib_info *fib_create_info(struct fib_config *cfg)
 		if (!nh->nh_dev)
 			goto failure;
 	} else {
+		dead = 0;
 		change_nexthops(fi) {
 			err = fib_check_nh(cfg, fi, nexthop_nh);
 			if (err != 0)
 				goto failure;
+			if (nexthop_nh->nh_flags & RTNH_F_DEAD)
+				dead++;
 		} endfor_nexthops(fi)
+		if ((dead == fi->fib_nhs)) {
+			fi->fib_flags |= RTNH_F_DEAD;
+			fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
+		}
 	}
 
 	if (fi->fib_prefsrc) {
@@ -1097,6 +1111,8 @@  int fib_sync_down_addr(struct net *net, __be32 local)
 			continue;
 		if (fi->fib_prefsrc == local) {
 			fi->fib_flags |= RTNH_F_DEAD;
+			/* Addr is gone, more serious than a linkdown */
+			fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
 			ret++;
 		}
 	}
@@ -1112,7 +1128,7 @@  int fib_sync_down_dev(struct net_device *dev, int force)
 	struct hlist_head *head = &fib_info_devhash[hash];
 	struct fib_nh *nh;
 
-	if (force)
+	if (force > 1)
 		scope = -1;
 
 	hlist_for_each_entry(nh, head, nh_hash) {
@@ -1147,6 +1163,11 @@  int fib_sync_down_dev(struct net_device *dev, int force)
 		} endfor_nexthops(fi)
 		if (dead == fi->fib_nhs) {
 			fi->fib_flags |= RTNH_F_DEAD;
+			/* force marks route down due to admin down and device removal. */
+			if (!force)
+				fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
+			else
+				fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
 			ret++;
 		}
 	}
@@ -1223,10 +1244,12 @@  int fib_sync_up(struct net_device *dev)
 	struct hlist_head *head;
 	struct fib_nh *nh;
 	int ret;
+	int link_up;
 
 	if (!(dev->flags & IFF_UP))
 		return 0;
 
+	link_up = netif_carrier_ok(dev);
 	prev_fi = NULL;
 	hash = fib_devindex_hashfn(dev->ifindex);
 	head = &fib_info_devhash[hash];
@@ -1253,16 +1276,27 @@  int fib_sync_up(struct net_device *dev)
 			if (nexthop_nh->nh_dev != dev ||
 			    !__in_dev_get_rtnl(dev))
 				continue;
-			alive++;
+			if (link_up) {
+				/* Link is up, so mark NH as alive */
+				nexthop_nh->nh_flags &= ~RTNH_F_DEAD;
+				alive++;
+			} else
+				nexthop_nh->nh_flags |= RTNH_F_DEAD;
+
 			spin_lock_bh(&fib_multipath_lock);
 			nexthop_nh->nh_power = 0;
-			nexthop_nh->nh_flags &= ~RTNH_F_DEAD;
 			spin_unlock_bh(&fib_multipath_lock);
 		} endfor_nexthops(fi)
 
 		if (alive > 0) {
+			/* Some NHs are alive, unmark the route as dead */
+			fi->fib_flags &= ~RTNH_F_DEAD_LINKDOWN;
 			fi->fib_flags &= ~RTNH_F_DEAD;
 			ret++;
+		} else {
+			/* No NHs are alive, mark the route as dead */
+			fi->fib_flags |= RTNH_F_DEAD_LINKDOWN;
+			fi->fib_flags |= RTNH_F_DEAD;
 		}
 	}
 
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 01bce15..eedf287 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1401,12 +1401,18 @@  found:
 #endif
 			return err;
 		}
-		if (fi->fib_flags & RTNH_F_DEAD)
-			continue;
+		if (!(fib_flags & FIB_LOOKUP_ALLOWDEAD)) {
+			/* if route is dead and link is down, keep looking  */
+			if ((fi->fib_flags & RTNH_F_DEAD) &&
+			    (fi->fib_flags & RTNH_F_DEAD_LINKDOWN))
+				continue;
+		}
 		for (nhsel = 0; nhsel < fi->fib_nhs; nhsel++) {
 			const struct fib_nh *nh = &fi->fib_nh[nhsel];
 
-			if (nh->nh_flags & RTNH_F_DEAD)
+			/* allow next-hop to be added if link is down */
+			if ((nh->nh_flags & RTNH_F_DEAD) &&
+			    !(fi->fib_flags & RTNH_F_DEAD_LINKDOWN))
 				continue;
 			if (flp->flowi4_oif && flp->flowi4_oif != nh->nh_oif)
 				continue;
@@ -1829,7 +1835,12 @@  int fib_table_flush(struct fib_table *tb)
 		hlist_for_each_entry_safe(fa, tmp, &n->leaf, fa_list) {
 			struct fib_info *fi = fa->fa_info;
 
-			if (!fi || !(fi->fib_flags & RTNH_F_DEAD)) {
+			/* DEAD and DEAD_LINKDOWN will not both be set
+			 * with IFF_UP is cleared, so do not flush
+			 * entries with only DEAD set
+			 */
+			if (!fi || !((fi->fib_flags & RTNH_F_DEAD) &&
+			    !(fi->fib_flags & RTNH_F_DEAD_LINKDOWN))) {
 				slen = fa->fa_slen;
 				continue;
 			}