diff mbox series

[net,v3,1/2] ipv6: Dump route exceptions too in rt6_dump_route()

Message ID f5ca22e91017e90842ee00aa4fd41dcdf7a6e99b.1560016091.git.sbrivio@redhat.com
State Changes Requested
Delegated to: David Miller
Headers show
Series ipv6: Fix listing and flushing of cached route exceptions | expand

Commit Message

Stefano Brivio June 8, 2019, 6:12 p.m. UTC
Since commit 2b760fcf5cfb ("ipv6: hook up exception table to store dst
cache"), route exceptions reside in a separate hash table, and won't be
found by walking the FIB, so they won't be dumped to userspace on a
RTM_GETROUTE message.

This causes 'ip -6 route list cache' and 'ip -6 route flush cache' to
have no function anymore:

 # ip -6 route get fc00:3::1
 fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 539sec mtu 1400 pref medium
 # ip -6 route get fc00:4::1
 fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 536sec mtu 1500 pref medium
 # ip -6 route list cache
 # ip -6 route flush cache
 # ip -6 route get fc00:3::1
 fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 520sec mtu 1400 pref medium
 # ip -6 route get fc00:4::1
 fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 519sec mtu 1500 pref medium

because iproute2 lists cached routes using RTM_GETROUTE, and flushes them
by listing all the routes, and deleting them with RTM_DELROUTE one by one.

Look up exceptions in the hash table associated with the current fib6_info
in rt6_dump_route(), and, if present and not expired, add them to the
dump.

We might be unable to dump all the entries for a given node in a single
message, so keep track of how many entries were handled for the current
node in fib6_walker, and skip that amount in case we start from the same
partially dumped node.

Re-allow userspace to get FIB results by passing the RTM_F_CLONED flag as
filter, by reverting commit 08e814c9e8eb ("net/ipv6: Bail early if user
only wants cloned entries"). Note that this is needed only for 'ip -6
route list cache': on a flush command, iproute2 doesn't pass RTM_F_CLONED.
Due to this inconsistency, we can't filter on RTM_F_CLONED, because a
flush command would not set the flag in the dump request and get no routes
to flush.

Also note that iproute2 will actually filter unwanted routes (i.e. print
exceptions iff the 'cache' specifier is given).

To avoid dumping exceptions if not requested, we can, in the future, add
support for NLM_F_MATCH as described by RFC 3549. This would also require
some changes in iproute2: whenever a 'cache' argument is given,
RTM_F_CLONED should be set in the dump request and, when filtering in the
kernel is desired, NLM_F_MATCH should be also passed. We can then signal
filtering with the NLM_F_DUMP_FILTERED whenever a NLM_F_MATCH flag caused
it.

To flush cached routes, a procfs entry could be introduced instead: that's
how it works for IPv4. We already have a rt6_flush_exception() function
ready to be wired to it. However, this would not solve the issue for
listing, and wouldn't fix the issue with current and previous versions of
iproute2.

v3:
  - more descriptive comment about expired exceptions in rt6_dump_route()
  - swap return values of rt6_dump_route() (suggested by Martin Lau)
  - don't zero skip_in_node in case we don't dump anything in a given pass
    (also suggested by Martin Lau)
  - remove check on RTM_F_CLONED altogether: in the current UAPI semantic,
    it's just a flag to indicate the route was cloned, not to filter on
    routes

v2: Add tracking of number of entries to be skipped in current node after
    a partial dump. As we restart from the same node, if not all the
    exceptions for a given node fit in a single message, the dump will
    not terminate, as suggested by Martin Lau. This is a concrete
    possibility, setting up a big number of exceptions for the same route
    actually causes the issue, suggested by David Ahern.

Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
This will cause a non-trivial conflict with commit cc5c073a693f
("ipv6: Move exception bucket to fib6_nh") on net-next. I can submit
an equivalent patch against net-next, if it helps.

 include/net/ip6_fib.h   |  1 +
 include/net/ip6_route.h |  2 +-
 net/ipv6/ip6_fib.c      | 21 ++++++++-----
 net/ipv6/route.c        | 67 ++++++++++++++++++++++++++++++++++++-----
 4 files changed, 76 insertions(+), 15 deletions(-)

Comments

David Ahern June 10, 2019, 9:31 p.m. UTC | #1
On 6/8/19 12:12 PM, Stefano Brivio wrote:
> To avoid dumping exceptions if not requested, we can, in the future, add
> support for NLM_F_MATCH as described by RFC 3549. This would also require
> some changes in iproute2: whenever a 'cache' argument is given,
> RTM_F_CLONED should be set in the dump request and, when filtering in the
> kernel is desired, NLM_F_MATCH should be also passed. We can then signal
> filtering with the NLM_F_DUMP_FILTERED whenever a NLM_F_MATCH flag caused
> it.

NLM_F_MATCH is set today. iproute2 for example uses NLM_F_DUMP for dump
requests and NLM_F_DUMP is defined as:

#define NLM_F_DUMP      (NLM_F_ROOT|NLM_F_MATCH)

further, the kernel already supports kernel side filtering now for
routes. See ip_valid_fib_dump_req.
Stefano Brivio June 10, 2019, 9:45 p.m. UTC | #2
On Mon, 10 Jun 2019 15:31:37 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 6/8/19 12:12 PM, Stefano Brivio wrote:
> > To avoid dumping exceptions if not requested, we can, in the future, add
> > support for NLM_F_MATCH as described by RFC 3549. This would also require
> > some changes in iproute2: whenever a 'cache' argument is given,
> > RTM_F_CLONED should be set in the dump request and, when filtering in the
> > kernel is desired, NLM_F_MATCH should be also passed. We can then signal
> > filtering with the NLM_F_DUMP_FILTERED whenever a NLM_F_MATCH flag caused
> > it.  
> 
> NLM_F_MATCH is set today. iproute2 for example uses NLM_F_DUMP for dump
> requests and NLM_F_DUMP is defined as:
> 
> #define NLM_F_DUMP      (NLM_F_ROOT|NLM_F_MATCH)
> 
> further, the kernel already supports kernel side filtering now for
> routes. See ip_valid_fib_dump_req.

Indeed, we don't have to add much: just make this work for IPv4 too,
honour NLM_F_MATCH, and skip filtering (further optimisation) on
NLM_F_DUMP_FILTERED in iproute2 (ip neigh already uses that).
David Ahern June 10, 2019, 9:47 p.m. UTC | #3
On 6/10/19 3:45 PM, Stefano Brivio wrote:
> Indeed, we don't have to add much: just make this work for IPv4 too,
> honour NLM_F_MATCH, and skip filtering (further optimisation) on
> NLM_F_DUMP_FILTERED in iproute2 (ip neigh already uses that).

you can't. Not all of iproute2's filter options are handled by the
kernel (and nor should they be).
Stefano Brivio June 10, 2019, 9:55 p.m. UTC | #4
On Mon, 10 Jun 2019 15:47:16 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 6/10/19 3:45 PM, Stefano Brivio wrote:
> > Indeed, we don't have to add much: just make this work for IPv4 too,
> > honour NLM_F_MATCH, and skip filtering (further optimisation) on
> > NLM_F_DUMP_FILTERED in iproute2 (ip neigh already uses that).  
> 
> you can't. Not all of iproute2's filter options are handled by the
> kernel (and nor should they be).

Right, of course. Discard that last part.
diff mbox series

Patch

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 855b352b660f..5909a9d8ff67 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -312,6 +312,7 @@  struct fib6_walker {
 	enum fib6_walk_state state;
 	unsigned int skip;
 	unsigned int count;
+	unsigned int skip_in_node;
 	int (*func)(struct fib6_walker *);
 	void *args;
 };
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 4790beaa86e0..b66c4aac56ab 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -178,7 +178,7 @@  struct rt6_rtnl_dump_arg {
 	struct fib_dump_filter filter;
 };
 
-int rt6_dump_route(struct fib6_info *f6i, void *p_arg);
+int rt6_dump_route(struct fib6_info *f6i, void *p_arg, unsigned int skip);
 void rt6_mtu_change(struct net_device *dev, unsigned int mtu);
 void rt6_remove_prefsrc(struct inet6_ifaddr *ifp);
 void rt6_clean_tohost(struct net *net, struct in6_addr *gateway);
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 9180c8b6f764..73dbad54baea 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -469,12 +469,19 @@  static int fib6_dump_node(struct fib6_walker *w)
 	struct fib6_info *rt;
 
 	for_each_fib6_walker_rt(w) {
-		res = rt6_dump_route(rt, w->args);
-		if (res < 0) {
+		res = rt6_dump_route(rt, w->args, w->skip_in_node);
+		if (res >= 0) {
 			/* Frame is full, suspend walking */
 			w->leaf = rt;
+
+			/* We'll restart from this node, so if some routes were
+			 * already dumped, skip them next time.
+			 */
+			w->skip_in_node += res;
+
 			return 1;
 		}
+		w->skip_in_node = 0;
 
 		/* Multipath routes are dumped in one route with the
 		 * RTA_MULTIPATH attribute. Jump 'rt' to point to the
@@ -526,6 +533,7 @@  static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb,
 	if (cb->args[4] == 0) {
 		w->count = 0;
 		w->skip = 0;
+		w->skip_in_node = 0;
 
 		spin_lock_bh(&table->tb6_lock);
 		res = fib6_walk(net, w);
@@ -541,6 +549,7 @@  static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb,
 			w->state = FWS_INIT;
 			w->node = w->root;
 			w->skip = w->count;
+			w->skip_in_node = 0;
 		} else
 			w->skip = 0;
 
@@ -577,13 +586,10 @@  static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	} else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
 		struct rtmsg *rtm = nlmsg_data(nlh);
 
-		arg.filter.flags = rtm->rtm_flags & (RTM_F_PREFIX|RTM_F_CLONED);
+		if (rtm->rtm_flags & RTM_F_PREFIX)
+			arg.filter.flags = RTM_F_PREFIX;
 	}
 
-	/* fib entries are never clones */
-	if (arg.filter.flags & RTM_F_CLONED)
-		goto out;
-
 	w = (void *)cb->args[2];
 	if (!w) {
 		/* New dump:
@@ -2041,6 +2047,7 @@  static void fib6_clean_tree(struct net *net, struct fib6_node *root,
 	c.w.func = fib6_clean_node;
 	c.w.count = 0;
 	c.w.skip = 0;
+	c.w.skip_in_node = 0;
 	c.func = func;
 	c.sernum = sernum;
 	c.arg = arg;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0f60eb3a2873..45ada000a98e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4854,33 +4854,86 @@  static bool fib6_info_uses_dev(const struct fib6_info *f6i,
 	return false;
 }
 
-int rt6_dump_route(struct fib6_info *rt, void *p_arg)
+/* Return -1 if done with node, number of handled routes on partial dump */
+int rt6_dump_route(struct fib6_info *rt, void *p_arg, unsigned int skip)
 {
 	struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) p_arg;
 	struct fib_dump_filter *filter = &arg->filter;
+	struct rt6_exception_bucket *bucket;
 	unsigned int flags = NLM_F_MULTI;
+	struct rt6_exception *rt6_ex;
 	struct net *net = arg->net;
+	int i, count = 0;
 
 	if (rt == net->ipv6.fib6_null_entry)
-		return 0;
+		return -1;
 
 	if ((filter->flags & RTM_F_PREFIX) &&
 	    !(rt->fib6_flags & RTF_PREFIX_RT)) {
 		/* success since this is not a prefix route */
-		return 1;
+		return -1;
 	}
 	if (filter->filter_set) {
 		if ((filter->rt_type && rt->fib6_type != filter->rt_type) ||
 		    (filter->dev && !fib6_info_uses_dev(rt, filter->dev)) ||
 		    (filter->protocol && rt->fib6_protocol != filter->protocol)) {
-			return 1;
+			return -1;
 		}
 		flags |= NLM_F_DUMP_FILTERED;
 	}
 
-	return rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0,
-			     RTM_NEWROUTE, NETLINK_CB(arg->cb->skb).portid,
-			     arg->cb->nlh->nlmsg_seq, flags);
+	if (skip) {
+		skip--;
+	} else {
+		if (rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0,
+				  RTM_NEWROUTE, NETLINK_CB(arg->cb->skb).portid,
+				  arg->cb->nlh->nlmsg_seq, flags)) {
+			return 0;
+		}
+
+		count++;
+	}
+
+	bucket = rcu_dereference(rt->rt6i_exception_bucket);
+	if (!bucket)
+		return -1;
+
+	for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) {
+		hlist_for_each_entry(rt6_ex, &bucket->chain, hlist) {
+			if (skip) {
+				skip--;
+				continue;
+			}
+
+			/* Expiration of entries doesn't bump sernum, insertion
+			 * does. Removal is triggered by insertion, so we can
+			 * rely on the fact that if entries change between two
+			 * partial dumps, this node is scanned again completely,
+			 * see rt6_insert_exception() and fib6_dump_table().
+			 *
+			 * Count expired entries we go through as handled
+			 * entries that we'll skip next time, in case of partial
+			 * node dump. Otherwise, if entries expire meanwhile,
+			 * we'll skip the wrong amount.
+			 */
+			if (rt6_check_expired(rt6_ex->rt6i)) {
+				count++;
+				continue;
+			}
+
+			if (rt6_fill_node(net, arg->skb, rt, &rt6_ex->rt6i->dst,
+					  NULL, NULL, 0, RTM_NEWROUTE,
+					  NETLINK_CB(arg->cb->skb).portid,
+					  arg->cb->nlh->nlmsg_seq, flags)) {
+				return count;
+			}
+
+			count++;
+		}
+		bucket++;
+	}
+
+	return -1;
 }
 
 static int inet6_rtm_valid_getroute_req(struct sk_buff *skb,