diff mbox series

[net,1/2] ipv6: Dump route exceptions too in rt6_dump_route()

Message ID 085ce9fbe0206be0d1d090b36e656aa89cef3d98.1559851514.git.sbrivio@redhat.com
State Superseded
Delegated to: David Miller
Headers show
Series ipv6: Fix listing and flushing of cached route exceptions | expand

Commit Message

Stefano Brivio June 6, 2019, 8:13 p.m. UTC
Since commit 2b760fcf5cfb ("ipv6: hook up exception table to store dst
cache"), route exceptions reside in a separate hash table, and won't be
found by walking the FIB, so they won't be dumped to userspace on a
RTM_GETROUTE message.

This causes 'ip -6 route list cache' and 'ip -6 route flush cache' to
have no function anymore:

 # ip -6 route get fc00:3::1
 fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 539sec mtu 1400 pref medium
 # ip -6 route get fc00:4::1
 fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 536sec mtu 1500 pref medium
 # ip -6 route list cache
 # ip -6 route flush cache
 # ip -6 route get fc00:3::1
 fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 520sec mtu 1400 pref medium
 # ip -6 route get fc00:4::1
 fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 519sec mtu 1500 pref medium

because iproute2 lists cached routes using RTM_GETROUTE, and flushes them
by listing all the routes, and deleting them with RTM_DELROUTE one by one.

Look up exceptions in the hash table associated with the current fib6_info
in rt6_dump_route(), and, if present and not expired, add them to the
dump.

Re-allow userspace to get FIB results by passing the RTM_F_CLONED flag as
filter, by reverting commit 08e814c9e8eb ("net/ipv6: Bail early if user
only wants cloned entries").

As we do this, we also have to honour this flag while filtering routes in
rt6_dump_route() and, if this filter effectively causes some results to be
discarded, by passing the NLM_F_DUMP_FILTERED flag back.

To flush cached routes, a procfs entry could be introduced instead: that's
how it works for IPv4. We already have a rt6_flush_exception() function
ready to be wired to it. However, this would not solve the issue for
listing, and wouldn't fix the issue with current and previous versions of
iproute2.

Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
This will cause a non-trivial conflict with commit cc5c073a693f
("ipv6: Move exception bucket to fib6_nh") on net-next. I can submit
an equivalent patch against net-next, if it helps.

 net/ipv6/ip6_fib.c |  7 ++-----
 net/ipv6/route.c   | 38 +++++++++++++++++++++++++++++++++++---
 2 files changed, 37 insertions(+), 8 deletions(-)

Comments

David Ahern June 6, 2019, 8:57 p.m. UTC | #1
On 6/6/19 2:13 PM, Stefano Brivio wrote:
> Since commit 2b760fcf5cfb ("ipv6: hook up exception table to store dst
> cache"), route exceptions reside in a separate hash table, and won't be
> found by walking the FIB, so they won't be dumped to userspace on a
> RTM_GETROUTE message.
> 
> This causes 'ip -6 route list cache' and 'ip -6 route flush cache' to
> have no function anymore:
> 
>  # ip -6 route get fc00:3::1
>  fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 539sec mtu 1400 pref medium
>  # ip -6 route get fc00:4::1
>  fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 536sec mtu 1500 pref medium
>  # ip -6 route list cache
>  # ip -6 route flush cache
>  # ip -6 route get fc00:3::1
>  fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 520sec mtu 1400 pref medium
>  # ip -6 route get fc00:4::1
>  fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 519sec mtu 1500 pref medium
> 
> because iproute2 lists cached routes using RTM_GETROUTE, and flushes them
> by listing all the routes, and deleting them with RTM_DELROUTE one by one.
> 
> Look up exceptions in the hash table associated with the current fib6_info
> in rt6_dump_route(), and, if present and not expired, add them to the
> dump.
> 
> Re-allow userspace to get FIB results by passing the RTM_F_CLONED flag as
> filter, by reverting commit 08e814c9e8eb ("net/ipv6: Bail early if user
> only wants cloned entries").
> 
> As we do this, we also have to honour this flag while filtering routes in
> rt6_dump_route() and, if this filter effectively causes some results to be
> discarded, by passing the NLM_F_DUMP_FILTERED flag back.
> 
> To flush cached routes, a procfs entry could be introduced instead: that's
> how it works for IPv4. We already have a rt6_flush_exception() function
> ready to be wired to it. However, this would not solve the issue for
> listing, and wouldn't fix the issue with current and previous versions of
> iproute2.
> 
> Reported-by: Jianlin Shi <jishi@redhat.com>
> Fixes: 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> This will cause a non-trivial conflict with commit cc5c073a693f
> ("ipv6: Move exception bucket to fib6_nh") on net-next. I can submit
> an equivalent patch against net-next, if it helps.
> 

Thanks for doing this. It is on my to-do list.

Can you do the same for IPv4?
Stefano Brivio June 6, 2019, 9:18 p.m. UTC | #2
On Thu, 6 Jun 2019 14:57:33 -0600
David Ahern <dsahern@gmail.com> wrote:

> > This will cause a non-trivial conflict with commit cc5c073a693f
> > ("ipv6: Move exception bucket to fib6_nh") on net-next. I can submit
> > an equivalent patch against net-next, if it helps.
> >   
> 
> Thanks for doing this. It is on my to-do list.
> 
> Can you do the same for IPv4?

You mean this same fix? On IPv4, for flushing, iproute2
uses /proc/sys/net/ipv4/route/flush in iproute_flush_cache(), and that
works.

Listing doesn't work instead, for some different reason I haven't
looked into yet. That doesn't look as critical as the situation on IPv6
where one can't even flush the cache: exceptions can also be fetched
with 'ip route get', and that works.

Still, it's bad, I can look into it within a few days.
Martin KaFai Lau June 6, 2019, 9:44 p.m. UTC | #3
On Thu, Jun 06, 2019 at 10:13:41PM +0200, Stefano Brivio wrote:
> Since commit 2b760fcf5cfb ("ipv6: hook up exception table to store dst
> cache"), route exceptions reside in a separate hash table, and won't be
> found by walking the FIB, so they won't be dumped to userspace on a
> RTM_GETROUTE message.
> 
> This causes 'ip -6 route list cache' and 'ip -6 route flush cache' to
> have no function anymore:
> 
>  # ip -6 route get fc00:3::1
>  fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 539sec mtu 1400 pref medium
>  # ip -6 route get fc00:4::1
>  fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 536sec mtu 1500 pref medium
>  # ip -6 route list cache
>  # ip -6 route flush cache
>  # ip -6 route get fc00:3::1
>  fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 520sec mtu 1400 pref medium
>  # ip -6 route get fc00:4::1
>  fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 519sec mtu 1500 pref medium
> 
> because iproute2 lists cached routes using RTM_GETROUTE, and flushes them
> by listing all the routes, and deleting them with RTM_DELROUTE one by one.
> 
> Look up exceptions in the hash table associated with the current fib6_info
> in rt6_dump_route(), and, if present and not expired, add them to the
> dump.
> 
> Re-allow userspace to get FIB results by passing the RTM_F_CLONED flag as
> filter, by reverting commit 08e814c9e8eb ("net/ipv6: Bail early if user
> only wants cloned entries").
> 
> As we do this, we also have to honour this flag while filtering routes in
> rt6_dump_route() and, if this filter effectively causes some results to be
> discarded, by passing the NLM_F_DUMP_FILTERED flag back.
> 
> To flush cached routes, a procfs entry could be introduced instead: that's
> how it works for IPv4. We already have a rt6_flush_exception() function
> ready to be wired to it. However, this would not solve the issue for
> listing, and wouldn't fix the issue with current and previous versions of
> iproute2.
> 
> Reported-by: Jianlin Shi <jishi@redhat.com>
> Fixes: 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> This will cause a non-trivial conflict with commit cc5c073a693f
> ("ipv6: Move exception bucket to fib6_nh") on net-next. I can submit
> an equivalent patch against net-next, if it helps.
> 
>  net/ipv6/ip6_fib.c |  7 ++-----
>  net/ipv6/route.c   | 38 +++++++++++++++++++++++++++++++++++---
>  2 files changed, 37 insertions(+), 8 deletions(-)
> 
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index 008421b550c6..5be133565819 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -581,13 +581,10 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
>  	} else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
>  		struct rtmsg *rtm = nlmsg_data(nlh);
>  
> -		arg.filter.flags = rtm->rtm_flags & (RTM_F_PREFIX|RTM_F_CLONED);
> +		if (rtm->rtm_flags & RTM_F_PREFIX)
> +			arg.filter.flags = RTM_F_PREFIX;
>  	}
>  
> -	/* fib entries are never clones */
> -	if (arg.filter.flags & RTM_F_CLONED)
> -		goto out;
> -
>  	w = (void *)cb->args[2];
>  	if (!w) {
>  		/* New dump:
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 848e944f07df..51f923b3ad26 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -4862,8 +4862,11 @@ int rt6_dump_route(struct fib6_info *rt, void *p_arg)
>  {
>  	struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) p_arg;
>  	struct fib_dump_filter *filter = &arg->filter;
> +	struct rt6_exception_bucket *bucket;
>  	unsigned int flags = NLM_F_MULTI;
> +	struct rt6_exception *rt6_ex;
>  	struct net *net = arg->net;
> +	int i, err;
>  
>  	if (rt == net->ipv6.fib6_null_entry)
>  		return 0;
> @@ -4882,9 +4885,38 @@ int rt6_dump_route(struct fib6_info *rt, void *p_arg)
>  		flags |= NLM_F_DUMP_FILTERED;
>  	}
>  
> -	return rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0,
> -			     RTM_NEWROUTE, NETLINK_CB(arg->cb->skb).portid,
> -			     arg->cb->nlh->nlmsg_seq, flags);
> +	if (!(filter->flags & RTM_F_CLONED)) {
> +		err = rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0,
> +				    RTM_NEWROUTE,
> +				    NETLINK_CB(arg->cb->skb).portid,
> +				    arg->cb->nlh->nlmsg_seq, flags);
> +		if (err)
> +			return err;
> +	} else {
> +		flags |= NLM_F_DUMP_FILTERED;
> +	}
> +
> +	bucket = rcu_dereference(rt->rt6i_exception_bucket);
> +	if (!bucket)
> +		return 0;
> +
> +	for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) {
> +		hlist_for_each_entry(rt6_ex, &bucket->chain, hlist) {
> +			if (rt6_check_expired(rt6_ex->rt6i))
> +				continue;
> +
> +			err = rt6_fill_node(net, arg->skb, rt,
> +					    &rt6_ex->rt6i->dst,
> +					    NULL, NULL, 0, RTM_NEWROUTE,
> +					    NETLINK_CB(arg->cb->skb).portid,
> +					    arg->cb->nlh->nlmsg_seq, flags);
Thanks for the patch.

A question on when rt6_fill_node() returns -EMSGSIZE while dumping the
exception bucket here.  Where will the next inet6_dump_fib() start?

> +			if (err)
> +				return err;
> +		}
> +		bucket++;
> +	}
> +
> +	return 0;
>  }
>  
>  static int inet6_rtm_valid_getroute_req(struct sk_buff *skb,
> -- 
> 2.20.1
>
Stefano Brivio June 6, 2019, 10:17 p.m. UTC | #4
On Thu, 6 Jun 2019 21:44:58 +0000
Martin Lau <kafai@fb.com> wrote:

> > +	if (!(filter->flags & RTM_F_CLONED)) {
> > +		err = rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0,
> > +				    RTM_NEWROUTE,
> > +				    NETLINK_CB(arg->cb->skb).portid,
> > +				    arg->cb->nlh->nlmsg_seq, flags);
> > +		if (err)
> > +			return err;
> > +	} else {
> > +		flags |= NLM_F_DUMP_FILTERED;
> > +	}
> > +
> > +	bucket = rcu_dereference(rt->rt6i_exception_bucket);
> > +	if (!bucket)
> > +		return 0;
> > +
> > +	for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) {
> > +		hlist_for_each_entry(rt6_ex, &bucket->chain, hlist) {
> > +			if (rt6_check_expired(rt6_ex->rt6i))
> > +				continue;
> > +
> > +			err = rt6_fill_node(net, arg->skb, rt,
> > +					    &rt6_ex->rt6i->dst,
> > +					    NULL, NULL, 0, RTM_NEWROUTE,
> > +					    NETLINK_CB(arg->cb->skb).portid,
> > +					    arg->cb->nlh->nlmsg_seq, flags);  
> Thanks for the patch.
> 
> A question on when rt6_fill_node() returns -EMSGSIZE while dumping the
> exception bucket here.  Where will the next inet6_dump_fib() start?

And thanks for reviewing.

It starts again from the same node, see fib6_dump_node(): w->leaf = rt;
where rt is the fib6_info where we failed dumping, so we won't skip
dumping any node.

This also means that to avoid sending duplicates in the case where at
least one rt6_fill_node() call goes through and one fails, we would
need to track the last bucket and entry sent, or, alternatively, to
make sure we can fit the whole node before dumping.

I don't think that can happen in practice, or at least I haven't found a
way to create enough valid exceptions for the same node.

Anyway, I guess that would be nicer, but the fix is going to be much
bigger, and I don't think we even have to guarantee that. I'd rather
take care of that as a follow-up. Any preferred solution by the way?
Martin KaFai Lau June 6, 2019, 10:37 p.m. UTC | #5
On Fri, Jun 07, 2019 at 12:17:47AM +0200, Stefano Brivio wrote:
> On Thu, 6 Jun 2019 21:44:58 +0000
> Martin Lau <kafai@fb.com> wrote:
> 
> > > +	if (!(filter->flags & RTM_F_CLONED)) {
> > > +		err = rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0,
> > > +				    RTM_NEWROUTE,
> > > +				    NETLINK_CB(arg->cb->skb).portid,
> > > +				    arg->cb->nlh->nlmsg_seq, flags);
> > > +		if (err)
> > > +			return err;
> > > +	} else {
> > > +		flags |= NLM_F_DUMP_FILTERED;
> > > +	}
> > > +
> > > +	bucket = rcu_dereference(rt->rt6i_exception_bucket);
> > > +	if (!bucket)
> > > +		return 0;
> > > +
> > > +	for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) {
> > > +		hlist_for_each_entry(rt6_ex, &bucket->chain, hlist) {
> > > +			if (rt6_check_expired(rt6_ex->rt6i))
> > > +				continue;
> > > +
> > > +			err = rt6_fill_node(net, arg->skb, rt,
> > > +					    &rt6_ex->rt6i->dst,
> > > +					    NULL, NULL, 0, RTM_NEWROUTE,
> > > +					    NETLINK_CB(arg->cb->skb).portid,
> > > +					    arg->cb->nlh->nlmsg_seq, flags);  
> > Thanks for the patch.
> > 
> > A question on when rt6_fill_node() returns -EMSGSIZE while dumping the
> > exception bucket here.  Where will the next inet6_dump_fib() start?
> 
> And thanks for reviewing.
> 
> It starts again from the same node, see fib6_dump_node(): w->leaf = rt;
> where rt is the fib6_info where we failed dumping, so we won't skip
> dumping any node.
If the same node will be dumped, does it mean that it will go through this
loop and iterate all exceptions again?

> 
> This also means that to avoid sending duplicates in the case where at
> least one rt6_fill_node() call goes through and one fails, we would
> need to track the last bucket and entry sent, or, alternatively, to
> make sure we can fit the whole node before dumping.
My another concern is the dump may never finish.

> 
> I don't think that can happen in practice, or at least I haven't found a
> way to create enough valid exceptions for the same node.
That I am not sure.  It is not unusual to have many pmtu exceptions in
a gateway node.

> 
> Anyway, I guess that would be nicer, but the fix is going to be much
> bigger, and I don't think we even have to guarantee that. I'd rather
> take care of that as a follow-up. Any preferred solution by the way?
> 
> -- 
> Stefano
David Ahern June 6, 2019, 10:47 p.m. UTC | #6
On 6/6/19 3:18 PM, Stefano Brivio wrote:
> On Thu, 6 Jun 2019 14:57:33 -0600
> David Ahern <dsahern@gmail.com> wrote:
> 
>>> This will cause a non-trivial conflict with commit cc5c073a693f
>>> ("ipv6: Move exception bucket to fib6_nh") on net-next. I can submit
>>> an equivalent patch against net-next, if it helps.
>>>   
>>
>> Thanks for doing this. It is on my to-do list.
>>
>> Can you do the same for IPv4?
> 
> You mean this same fix? On IPv4, for flushing, iproute2
> uses /proc/sys/net/ipv4/route/flush in iproute_flush_cache(), and that
> works.
> 
> Listing doesn't work instead, for some different reason I haven't
> looked into yet. That doesn't look as critical as the situation on IPv6
> where one can't even flush the cache: exceptions can also be fetched
> with 'ip route get', and that works.
> 
> Still, it's bad, I can look into it within a few days.
> 

I meant the ability to dump the exception cache.

Currently, we do not get the exceptions in a fib dump. There is a flag
to only show cloned (cached) entries, but no way to say 'no cloned
entries'. Maybe these should only be dumped if the cloned flag is set.
That's the use case I was targeting:
1. fib dumps - RTM_F_CLONED not set
2. exception dump - RTM_F_CLONED set
David Ahern June 6, 2019, 10:48 p.m. UTC | #7
On 6/6/19 4:37 PM, Martin Lau wrote:
>> I don't think that can happen in practice, or at least I haven't found a
>> way to create enough valid exceptions for the same node.
> That I am not sure.  It is not unusual to have many pmtu exceptions in
> a gateway node.
> 

yes.

Stefano: you could generalize this test script
   http://patchwork.ozlabs.org/patch/1110802/
to have N-remote hosts
Stefano Brivio June 6, 2019, 10:58 p.m. UTC | #8
On Thu, 6 Jun 2019 22:37:11 +0000
Martin Lau <kafai@fb.com> wrote:

> On Fri, Jun 07, 2019 at 12:17:47AM +0200, Stefano Brivio wrote:
> > On Thu, 6 Jun 2019 21:44:58 +0000
> > Martin Lau <kafai@fb.com> wrote:
> >   
> > > > +	if (!(filter->flags & RTM_F_CLONED)) {
> > > > +		err = rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0,
> > > > +				    RTM_NEWROUTE,
> > > > +				    NETLINK_CB(arg->cb->skb).portid,
> > > > +				    arg->cb->nlh->nlmsg_seq, flags);
> > > > +		if (err)
> > > > +			return err;
> > > > +	} else {
> > > > +		flags |= NLM_F_DUMP_FILTERED;
> > > > +	}
> > > > +
> > > > +	bucket = rcu_dereference(rt->rt6i_exception_bucket);
> > > > +	if (!bucket)
> > > > +		return 0;
> > > > +
> > > > +	for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) {
> > > > +		hlist_for_each_entry(rt6_ex, &bucket->chain, hlist) {
> > > > +			if (rt6_check_expired(rt6_ex->rt6i))
> > > > +				continue;
> > > > +
> > > > +			err = rt6_fill_node(net, arg->skb, rt,
> > > > +					    &rt6_ex->rt6i->dst,
> > > > +					    NULL, NULL, 0, RTM_NEWROUTE,
> > > > +					    NETLINK_CB(arg->cb->skb).portid,
> > > > +					    arg->cb->nlh->nlmsg_seq, flags);    
> > > Thanks for the patch.
> > > 
> > > A question on when rt6_fill_node() returns -EMSGSIZE while dumping the
> > > exception bucket here.  Where will the next inet6_dump_fib() start?  
> > 
> > And thanks for reviewing.
> > 
> > It starts again from the same node, see fib6_dump_node(): w->leaf = rt;
> > where rt is the fib6_info where we failed dumping, so we won't skip
> > dumping any node.  
> If the same node will be dumped, does it mean that it will go through this
> loop and iterate all exceptions again?

Yes (well, all the exceptions for that node).

> > This also means that to avoid sending duplicates in the case where at
> > least one rt6_fill_node() call goes through and one fails, we would
> > need to track the last bucket and entry sent, or, alternatively, to
> > make sure we can fit the whole node before dumping.  
> My another concern is the dump may never finish.

That's not a guarantee in general, even without this, because in theory
the skb passed might be small enough that we can't even fit a single
node without exceptions.

We could add a guard on w->leaf not being the same before and after the
walk in inet6_dump_fib() and, if it is, terminate the dump. I just
wonder if we have to do this at all -- I can't find this being done
anywhere else (at a quick look at least).

By the way, we can also trigger a never-ending dump by touching the
tree frequently enough during a dump: it would always start again from
the root, see fib6_dump_table().
Stefano Brivio June 6, 2019, 11:07 p.m. UTC | #9
On Thu, 6 Jun 2019 16:47:00 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 6/6/19 3:18 PM, Stefano Brivio wrote:
> > On Thu, 6 Jun 2019 14:57:33 -0600
> > David Ahern <dsahern@gmail.com> wrote:
> >   
> >>> This will cause a non-trivial conflict with commit cc5c073a693f
> >>> ("ipv6: Move exception bucket to fib6_nh") on net-next. I can submit
> >>> an equivalent patch against net-next, if it helps.
> >>>     
> >>
> >> Thanks for doing this. It is on my to-do list.
> >>
> >> Can you do the same for IPv4?  
> > 
> > You mean this same fix? On IPv4, for flushing, iproute2
> > uses /proc/sys/net/ipv4/route/flush in iproute_flush_cache(), and that
> > works.
> > 
> > Listing doesn't work instead, for some different reason I haven't
> > looked into yet. That doesn't look as critical as the situation on IPv6
> > where one can't even flush the cache: exceptions can also be fetched
> > with 'ip route get', and that works.
> > 
> > Still, it's bad, I can look into it within a few days.
> >   
> 
> I meant the ability to dump the exception cache.
> 
> Currently, we do not get the exceptions in a fib dump. There is a flag
> to only show cloned (cached) entries, but no way to say 'no cloned
> entries'. Maybe these should only be dumped if the cloned flag is set.
> That's the use case I was targeting:
> 1. fib dumps - RTM_F_CLONED not set
> 2. exception dump - RTM_F_CLONED set

I think it would make a lot of sense. But don't we risk breaking
userspace even further, by skipping exceptions if RTM_F_CLONED is not
set? On the other hand, this was broken for almost two years, maybe
it's not too bad after all.
Stefano Brivio June 6, 2019, 11:15 p.m. UTC | #10
On Fri, 7 Jun 2019 00:58:52 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Thu, 6 Jun 2019 22:37:11 +0000
> Martin Lau <kafai@fb.com> wrote:
> 
> > On Fri, Jun 07, 2019 at 12:17:47AM +0200, Stefano Brivio wrote:  
> > > On Thu, 6 Jun 2019 21:44:58 +0000
> > > Martin Lau <kafai@fb.com> wrote:
> > >     
> > > > > +	if (!(filter->flags & RTM_F_CLONED)) {
> > > > > +		err = rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0,
> > > > > +				    RTM_NEWROUTE,
> > > > > +				    NETLINK_CB(arg->cb->skb).portid,
> > > > > +				    arg->cb->nlh->nlmsg_seq, flags);
> > > > > +		if (err)
> > > > > +			return err;
> > > > > +	} else {
> > > > > +		flags |= NLM_F_DUMP_FILTERED;
> > > > > +	}
> > > > > +
> > > > > +	bucket = rcu_dereference(rt->rt6i_exception_bucket);
> > > > > +	if (!bucket)
> > > > > +		return 0;
> > > > > +
> > > > > +	for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) {
> > > > > +		hlist_for_each_entry(rt6_ex, &bucket->chain, hlist) {
> > > > > +			if (rt6_check_expired(rt6_ex->rt6i))
> > > > > +				continue;
> > > > > +
> > > > > +			err = rt6_fill_node(net, arg->skb, rt,
> > > > > +					    &rt6_ex->rt6i->dst,
> > > > > +					    NULL, NULL, 0, RTM_NEWROUTE,
> > > > > +					    NETLINK_CB(arg->cb->skb).portid,
> > > > > +					    arg->cb->nlh->nlmsg_seq, flags);      
> > > > Thanks for the patch.
> > > > 
> > > > A question on when rt6_fill_node() returns -EMSGSIZE while dumping the
> > > > exception bucket here.  Where will the next inet6_dump_fib() start?    
> > > 
> > > And thanks for reviewing.
> > > 
> > > It starts again from the same node, see fib6_dump_node(): w->leaf = rt;
> > > where rt is the fib6_info where we failed dumping, so we won't skip
> > > dumping any node.    
> > If the same node will be dumped, does it mean that it will go through this
> > loop and iterate all exceptions again?  
> 
> Yes (well, all the exceptions for that node).
> 
> > > This also means that to avoid sending duplicates in the case where at
> > > least one rt6_fill_node() call goes through and one fails, we would
> > > need to track the last bucket and entry sent, or, alternatively, to
> > > make sure we can fit the whole node before dumping.    
> > My another concern is the dump may never finish.  
> 
> That's not a guarantee in general, even without this, because in theory
> the skb passed might be small enough that we can't even fit a single
> node without exceptions.
> 
> We could add a guard on w->leaf not being the same before and after the
> walk in inet6_dump_fib() and, if it is, terminate the dump. I just
> wonder if we have to do this at all -- I can't find this being done
> anywhere else (at a quick look at least).

I still can't convince myself this is an actual issue, but... somewhat
simpler: let's add a field to fib6_walker, that counts the entries
(both from FIB and exceptions) already dumped for the current node:

		res = rt6_dump_route(rt, w->args);
		if (res) {
			/* Frame is full, suspend walking */
			w->leaf = rt;
			w->skip_node = res;
			return 1;
		}

if the current leaf changes (tree changed), we reset it. And we use that
to skip rt6_fill_node() calls in rt6_dump_route(). What do you think?
David Ahern June 6, 2019, 11:19 p.m. UTC | #11
On 6/6/19 4:58 PM, Stefano Brivio wrote:
>>> This also means that to avoid sending duplicates in the case where at
>>> least one rt6_fill_node() call goes through and one fails, we would
>>> need to track the last bucket and entry sent, or, alternatively, to
>>> make sure we can fit the whole node before dumping.  
>> My another concern is the dump may never finish.
> 
> That's not a guarantee in general, even without this, because in theory
> the skb passed might be small enough that we can't even fit a single
> node without exceptions.

That should be handled by skb->len = 0 and then returning the err back
to caller. See inet_dump_fib.

> 
> We could add a guard on w->leaf not being the same before and after the
> walk in inet6_dump_fib() and, if it is, terminate the dump. I just
> wonder if we have to do this at all -- I can't find this being done
> anywhere else (at a quick look at least).
> 
> By the way, we can also trigger a never-ending dump by touching the
> tree frequently enough during a dump: it would always start again from
> the root, see fib6_dump_table().
> 

that should be a userspace problem on a sequence mismatch. In-kernel
notifiers do restart for offload drivers (seeregister_fib_notifier), but
as I recall the kernel should not restart a dump.

libnl reference is nl_cache_refill,

                err = nl_cache_pickup(sk, cache);
                if (err == -NLE_DUMP_INTR) {
                        NL_DBG(2, "Dump interrupted, restarting!\n");
                        goto restart;
                } else if (err < 0)
                        break;
Martin KaFai Lau June 6, 2019, 11:31 p.m. UTC | #12
On Fri, Jun 07, 2019 at 12:58:52AM +0200, Stefano Brivio wrote:
> On Thu, 6 Jun 2019 22:37:11 +0000
> Martin Lau <kafai@fb.com> wrote:
> 
> > On Fri, Jun 07, 2019 at 12:17:47AM +0200, Stefano Brivio wrote:
> > > On Thu, 6 Jun 2019 21:44:58 +0000
> > > Martin Lau <kafai@fb.com> wrote:
> > >   
> > > > > +	if (!(filter->flags & RTM_F_CLONED)) {
> > > > > +		err = rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0,
> > > > > +				    RTM_NEWROUTE,
> > > > > +				    NETLINK_CB(arg->cb->skb).portid,
> > > > > +				    arg->cb->nlh->nlmsg_seq, flags);
> > > > > +		if (err)
> > > > > +			return err;
> > > > > +	} else {
> > > > > +		flags |= NLM_F_DUMP_FILTERED;
> > > > > +	}
> > > > > +
> > > > > +	bucket = rcu_dereference(rt->rt6i_exception_bucket);
> > > > > +	if (!bucket)
> > > > > +		return 0;
> > > > > +
> > > > > +	for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) {
> > > > > +		hlist_for_each_entry(rt6_ex, &bucket->chain, hlist) {
> > > > > +			if (rt6_check_expired(rt6_ex->rt6i))
> > > > > +				continue;
> > > > > +
> > > > > +			err = rt6_fill_node(net, arg->skb, rt,
> > > > > +					    &rt6_ex->rt6i->dst,
> > > > > +					    NULL, NULL, 0, RTM_NEWROUTE,
> > > > > +					    NETLINK_CB(arg->cb->skb).portid,
> > > > > +					    arg->cb->nlh->nlmsg_seq, flags);    
> > > > Thanks for the patch.
> > > > 
> > > > A question on when rt6_fill_node() returns -EMSGSIZE while dumping the
> > > > exception bucket here.  Where will the next inet6_dump_fib() start?  
> > > 
> > > And thanks for reviewing.
> > > 
> > > It starts again from the same node, see fib6_dump_node(): w->leaf = rt;
> > > where rt is the fib6_info where we failed dumping, so we won't skip
> > > dumping any node.  
> > If the same node will be dumped, does it mean that it will go through this
> > loop and iterate all exceptions again?
> 
> Yes (well, all the exceptions for that node).
> 
> > > This also means that to avoid sending duplicates in the case where at
> > > least one rt6_fill_node() call goes through and one fails, we would
> > > need to track the last bucket and entry sent, or, alternatively, to
> > > make sure we can fit the whole node before dumping.  
> > My another concern is the dump may never finish.
> 
> That's not a guarantee in general, even without this, because in theory
> the skb passed might be small enough that we can't even fit a single
> node without exceptions.
That is arguably the caller's responsibility to retry
with a larger buffer if it cannot even get a single route.

If caller provides a large enough buffer for a single route,
the kernel should guarantee forward progress.

I think the minimum is to remember how many exceptions have to be
skipped.

> 
> We could add a guard on w->leaf not being the same before and after the
> walk in inet6_dump_fib() and, if it is, terminate the dump. I just
> wonder if we have to do this at all -- I can't find this being done
> anywhere else (at a quick look at least).
> 
> By the way, we can also trigger a never-ending dump by touching the
> tree frequently enough during a dump: it would always start again from
> the root, see fib6_dump_table().
This case "cb->args[5] != w->root->fn_sernum"?  It seems there is a w->skip
to take care of it.

Regardless, I don't think we should make it worse.
Stefano Brivio June 7, 2019, 1:54 a.m. UTC | #13
On Thu, 6 Jun 2019 16:48:51 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 6/6/19 4:37 PM, Martin Lau wrote:
> >> I don't think that can happen in practice, or at least I haven't found a
> >> way to create enough valid exceptions for the same node.  
> > That I am not sure.  It is not unusual to have many pmtu exceptions in
> > a gateway node.
> >   
> 
> yes.
> 
> Stefano: you could generalize this test script
>    http://patchwork.ozlabs.org/patch/1110802/
> to have N-remote hosts

Right, thanks for the pointer. I ended up doing something like that in
pmtu.sh, and it turns out that, starting from 25 exceptions in the same
node, iproute2 doesn't actually retry with a larger buffer. As Martin
predicted (thanks!) the dump doesn't terminate.

I tested a version that counts the number of routes in a partial dump
and skips them on the next one with 10,000 entries, dump terminates and
entries count is consistent (at some point, the buckets are just full,
and number of entries doesn't increase any longer).

Unfortunately, the setup of the test takes a few minutes, so I wouldn't
include it (at least as it is) in the selftest.

I'll post that as v2 soon.
Martin KaFai Lau June 8, 2019, 5:40 a.m. UTC | #14
On Thu, Jun 06, 2019 at 04:47:00PM -0600, David Ahern wrote:
> On 6/6/19 3:18 PM, Stefano Brivio wrote:
> > On Thu, 6 Jun 2019 14:57:33 -0600
> > David Ahern <dsahern@gmail.com> wrote:
> > 
> >>> This will cause a non-trivial conflict with commit cc5c073a693f
> >>> ("ipv6: Move exception bucket to fib6_nh") on net-next. I can submit
> >>> an equivalent patch against net-next, if it helps.
> >>>   
> >>
> >> Thanks for doing this. It is on my to-do list.
> >>
> >> Can you do the same for IPv4?
> > 
> > You mean this same fix? On IPv4, for flushing, iproute2
> > uses /proc/sys/net/ipv4/route/flush in iproute_flush_cache(), and that
> > works.
> > 
> > Listing doesn't work instead, for some different reason I haven't
> > looked into yet. That doesn't look as critical as the situation on IPv6
> > where one can't even flush the cache: exceptions can also be fetched
> > with 'ip route get', and that works.
> > 
> > Still, it's bad, I can look into it within a few days.
> > 
> 
> I meant the ability to dump the exception cache.
> 
> Currently, we do not get the exceptions in a fib dump. There is a flag
> to only show cloned (cached) entries, but no way to say 'no cloned
> entries'. Maybe these should only be dumped if the cloned flag is set.
> That's the use case I was targeting:
> 1. fib dumps - RTM_F_CLONED not set
I also think the fib dump should stay as is.

To be clear, I do not expect exception routes output from the
'ip [-6] r l'.  Otherwise, I will get pages of exceptions
that I am not interested at.  This should apply for both
v4 and v6.

> 2. exception dump - RTM_F_CLONED set
Stefano Brivio June 8, 2019, 5:59 a.m. UTC | #15
On Sat, 8 Jun 2019 05:40:06 +0000
Martin Lau <kafai@fb.com> wrote:

> On Thu, Jun 06, 2019 at 04:47:00PM -0600, David Ahern wrote:
> > On 6/6/19 3:18 PM, Stefano Brivio wrote:  
> > > On Thu, 6 Jun 2019 14:57:33 -0600
> > > David Ahern <dsahern@gmail.com> wrote:
> > >   
> > >>> This will cause a non-trivial conflict with commit cc5c073a693f
> > >>> ("ipv6: Move exception bucket to fib6_nh") on net-next. I can submit
> > >>> an equivalent patch against net-next, if it helps.
> > >>>     
> > >>
> > >> Thanks for doing this. It is on my to-do list.
> > >>
> > >> Can you do the same for IPv4?  
> > > 
> > > You mean this same fix? On IPv4, for flushing, iproute2
> > > uses /proc/sys/net/ipv4/route/flush in iproute_flush_cache(), and that
> > > works.
> > > 
> > > Listing doesn't work instead, for some different reason I haven't
> > > looked into yet. That doesn't look as critical as the situation on IPv6
> > > where one can't even flush the cache: exceptions can also be fetched
> > > with 'ip route get', and that works.
> > > 
> > > Still, it's bad, I can look into it within a few days.
> > >   
> > 
> > I meant the ability to dump the exception cache.
> > 
> > Currently, we do not get the exceptions in a fib dump. There is a flag
> > to only show cloned (cached) entries, but no way to say 'no cloned
> > entries'. Maybe these should only be dumped if the cloned flag is set.
> > That's the use case I was targeting:
> > 1. fib dumps - RTM_F_CLONED not set  
> I also think the fib dump should stay as is.
> 
> To be clear, I do not expect exception routes output from the
> 'ip [-6] r l'.  Otherwise, I will get pages of exceptions
> that I am not interested at.  This should apply for both
> v4 and v6.

I also agree it makes more sense to filter routes this way.

But it wasn't like this before 2b760fcf5cfb, so this smells like
breaking userspace expectations, even though iproute already filters
routes this way: with 'cache' it only displays routes with
RTM_F_CLONED, without, it won't display exceptions, see filter_nlmsg():

	if (filter.cloned == !(r->rtm_flags & RTM_F_CLONED))
		return 0;

This, together with the fact it's been like that for almost two years
now, makes it acceptable in my opinion. What do you think?

If we agree on this, I'll go ahead and start changing this in my patch
for IPv6.
Martin KaFai Lau June 8, 2019, 7:19 a.m. UTC | #16
On Sat, Jun 08, 2019 at 07:59:11AM +0200, Stefano Brivio wrote:
> On Sat, 8 Jun 2019 05:40:06 +0000
> Martin Lau <kafai@fb.com> wrote:
> 
> > On Thu, Jun 06, 2019 at 04:47:00PM -0600, David Ahern wrote:
> > > On 6/6/19 3:18 PM, Stefano Brivio wrote:  
> > > > On Thu, 6 Jun 2019 14:57:33 -0600
> > > > David Ahern <dsahern@gmail.com> wrote:
> > > >   
> > > >>> This will cause a non-trivial conflict with commit cc5c073a693f
> > > >>> ("ipv6: Move exception bucket to fib6_nh") on net-next. I can submit
> > > >>> an equivalent patch against net-next, if it helps.
> > > >>>     
> > > >>
> > > >> Thanks for doing this. It is on my to-do list.
> > > >>
> > > >> Can you do the same for IPv4?  
> > > > 
> > > > You mean this same fix? On IPv4, for flushing, iproute2
> > > > uses /proc/sys/net/ipv4/route/flush in iproute_flush_cache(), and that
> > > > works.
> > > > 
> > > > Listing doesn't work instead, for some different reason I haven't
> > > > looked into yet. That doesn't look as critical as the situation on IPv6
> > > > where one can't even flush the cache: exceptions can also be fetched
> > > > with 'ip route get', and that works.
> > > > 
> > > > Still, it's bad, I can look into it within a few days.
> > > >   
> > > 
> > > I meant the ability to dump the exception cache.
> > > 
> > > Currently, we do not get the exceptions in a fib dump. There is a flag
> > > to only show cloned (cached) entries, but no way to say 'no cloned
> > > entries'. Maybe these should only be dumped if the cloned flag is set.
> > > That's the use case I was targeting:
> > > 1. fib dumps - RTM_F_CLONED not set  
> > I also think the fib dump should stay as is.
> > 
> > To be clear, I do not expect exception routes output from the
> > 'ip [-6] r l'.  Otherwise, I will get pages of exceptions
> > that I am not interested at.  This should apply for both
> > v4 and v6.
> 
> I also agree it makes more sense to filter routes this way.
> 
> But it wasn't like this before 2b760fcf5cfb, so this smells like
> breaking userspace expectations, even though iproute already filters
> routes this way: with 'cache' it only displays routes with
> RTM_F_CLONED, without, it won't display exceptions, see filter_nlmsg():
Thanks for pointing it out.

> 
> 	if (filter.cloned == !(r->rtm_flags & RTM_F_CLONED))
> 		return 0;
> 
> This, together with the fact it's been like that for almost two years
> now, makes it acceptable in my opinion. What do you think?
With learning the above fact on iproute2,
it makes even less sense to dump exceptions from the kernel side
when RTM_F_CLONED is not set.

> If we agree on this, I'll go ahead and start changing this in my patch
> for IPv6.
Stefano Brivio June 8, 2019, 3:02 p.m. UTC | #17
On Sat, 8 Jun 2019 07:19:23 +0000
Martin Lau <kafai@fb.com> wrote:

> On Sat, Jun 08, 2019 at 07:59:11AM +0200, Stefano Brivio wrote:
> > I also agree it makes more sense to filter routes this way.
> > 
> > But it wasn't like this before 2b760fcf5cfb, so this smells like
> > breaking userspace expectations, even though iproute already filters
> > routes this way: with 'cache' it only displays routes with
> > RTM_F_CLONED, without, it won't display exceptions, see filter_nlmsg():  
> Thanks for pointing it out.
> 
> > 	if (filter.cloned == !(r->rtm_flags & RTM_F_CLONED))
> > 		return 0;
> > 
> > This, together with the fact it's been like that for almost two years
> > now, makes it acceptable in my opinion. What do you think?  
> With learning the above fact on iproute2,
> it makes even less sense to dump exceptions from the kernel side
> when RTM_F_CLONED is not set.

I just hit a more fundamental problem though: iproute2 filters on the
flag, but never sets it on a dump request. Flags will be NLM_F_DUMP |
NLM_F_REQUEST, no matter what, see rtnl_routedump_req(). So the current
iproute2 would have no way to dump cached routes.

It could from 2007, iproute2 9ab4c85b9af1 ("Fix bug in display of ipv6
cloned/cached routes"), to 2017, kernel 2b760fcf5cfb ("ipv6: hook up
exception table to store dst cache").

Something tells me it's wrong to fix userspace, because userspace "is
always right". There has been by the way a similar discussion on this
list in 2011, see https://lists.openwall.net/netdev/2011/12/28/27.

I would proceed like this:

- stick to the original semantic of RTM_F_CLONED and fix the issue at
  hand, which would be v2 with your suggested clean-up and without
  check on RTM_F_CLONED. Exceptions are always dumped and iproute2 will
  filter them as it always did. Result: kernel sends exceptions on
  netlink even if not "requested" but iproute2 works again and won't
  spam you anyway, and the issue is fixed for the users

- fix this on IPv4 (as I mentioned, I think it's less critical, because
  at least flushing works, and listing with 'route get' is awkward but
  possible)

- retry adding NLM_F_MATCH (for net-next and iproute-next) according
  to RFC 3549. Things changed a bit from 2011: we now have
  NLM_F_DUMP_FILTERED, iproute2 already uses it (ip neigh) and we
  wouldn't need to make iproute2 more complicated by handling old/new
  kernel cases. So I think this would be reasonable now.
Stefano Brivio June 8, 2019, 3:47 p.m. UTC | #18
On Sat, 8 Jun 2019 17:02:06 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Sat, 8 Jun 2019 07:19:23 +0000
> Martin Lau <kafai@fb.com> wrote:
> 
> > On Sat, Jun 08, 2019 at 07:59:11AM +0200, Stefano Brivio wrote:  
> > > I also agree it makes more sense to filter routes this way.
> > > 
> > > But it wasn't like this before 2b760fcf5cfb, so this smells like
> > > breaking userspace expectations, even though iproute already filters
> > > routes this way: with 'cache' it only displays routes with
> > > RTM_F_CLONED, without, it won't display exceptions, see filter_nlmsg():    
> > Thanks for pointing it out.
> >   
> > > 	if (filter.cloned == !(r->rtm_flags & RTM_F_CLONED))
> > > 		return 0;
> > > 
> > > This, together with the fact it's been like that for almost two years
> > > now, makes it acceptable in my opinion. What do you think?    
> > With learning the above fact on iproute2,
> > it makes even less sense to dump exceptions from the kernel side
> > when RTM_F_CLONED is not set.  
> 
> I just hit a more fundamental problem though: iproute2 filters on the
> flag, but never sets it on a dump request. Flags will be NLM_F_DUMP |
> NLM_F_REQUEST, no matter what, see rtnl_routedump_req(). So the current
> iproute2 would have no way to dump cached routes.

Partially wrong: it actually sets it on 'list':

	if (rtnl_routedump_req(&rth, dump_family, iproute_dump_filter) < 0) {

[...]
static int iproute_dump_filter(struct nlmsghdr *nlh, int reqlen)
[...]
	if (filter.cloned)
		rtm->rtm_flags |= RTM_F_CLONED;

but not on 'flush':

		if (rtnl_routedump_req(&rth, family, NULL) < 0) {

but this doesn't change things much: it still has no way to flush the
cache, because the dump to get the routes to flush doesn't contain the
exceptions.

So I would stick to my latest plan.
Matti Vaittinen June 10, 2019, 5:56 a.m. UTC | #19
Hi Dee Ho Peeps!

Wow Stefano, you seem to be quite a detective :) How on earth did you
match my new email to this sole netdev intrusion done back at the 2011
%) Impressive!

On Sat, 2019-06-08 at 17:02 +0200, Stefano Brivio wrote:

> 
> - retry adding NLM_F_MATCH (for net-next and iproute-next) according
>   to RFC 3549. Things changed a bit from 2011: we now have
>   NLM_F_DUMP_FILTERED, iproute2 already uses it (ip neigh) and we
>   wouldn't need to make iproute2 more complicated by handling old/new
>   kernel cases. So I think this would be reasonable now.
> 
I am pretty sure the iproute would not have become more complicated
back in 2011 even if we did push this change back then. iproute2 could
have chosen to stick with own userspace filtering - supporting the
NLM_F_MATCH flag back then would not have broken that. And if we did it
back then - there now probably was some other tools utilizing the
kernel filtering - and today the iproute2 could pretty safely drop the
user-space route filtering code and transition to do filtering already
in kernel. Well, that's a bit late to say today :)

But yes, this unfinished thing has indeed haunted me during some black
nights =) I would be delighted to see the proper NLM_F_MATCH support in
kernel.

What stopped me back in the 2011 was actually Dave's comment that even
if he could consider applying this change he would require it for IPv4
too. And that makes perfect sense. It was just too much for me back
then. I guess this has not changed - IPv6 and IPv4 should still handle
these flags in a same way.

Br,
	Matti Vaittinen
Stefano Brivio June 10, 2019, 7:01 p.m. UTC | #20
Hi Matti,

On Mon, 10 Jun 2019 05:56:03 +0000
"Vaittinen, Matti" <Matti.Vaittinen@fi.rohmeurope.com> wrote:

> Hi Dee Ho Peeps!
> 
> Wow Stefano, you seem to be quite a detective :) How on earth did you
> match my new email to this sole netdev intrusion done back at the 2011
> %) Impressive!

I was looking for lost UDP jokes. :)

> On Sat, 2019-06-08 at 17:02 +0200, Stefano Brivio wrote:
> 
> > 
> > - retry adding NLM_F_MATCH (for net-next and iproute-next) according
> >   to RFC 3549. Things changed a bit from 2011: we now have
> >   NLM_F_DUMP_FILTERED, iproute2 already uses it (ip neigh) and we
> >   wouldn't need to make iproute2 more complicated by handling old/new
> >   kernel cases. So I think this would be reasonable now.
> >   
> I am pretty sure the iproute would not have become more complicated
> back in 2011 even if we did push this change back then. iproute2 could
> have chosen to stick with own userspace filtering - supporting the
> NLM_F_MATCH flag back then would not have broken that. And if we did it
> back then - there now probably was some other tools utilizing the
> kernel filtering - and today the iproute2 could pretty safely drop the
> user-space route filtering code and transition to do filtering already
> in kernel. Well, that's a bit late to say today :)

Well, yes, it wouldn't have become more complicated. The difference
today is that, with NLM_F_DUMP_FILTERED codifying this, we could safely
and simply skip userspace filtering with just two lines, something like:

	if (n->nlmsg_flags & RTM_F_DUMP_FILTERED)
		return 1;

in filter_nlmsg().

> But yes, this unfinished thing has indeed haunted me during some black
> nights =) I would be delighted to see the proper NLM_F_MATCH support in
> kernel.
> 
> What stopped me back in the 2011 was actually Dave's comment that even
> if he could consider applying this change he would require it for IPv4
> too. And that makes perfect sense. It was just too much for me back
> then. I guess this has not changed - IPv6 and IPv4 should still handle
> these flags in a same way.

Indeed, we'll have to take care of both. It's on my to do list now, I
should get to it soon.

Right now my priority is to fix the fact that, at least with iproute2,
flushing IPv6 routed caches is simply not possible anymore. This looks
like a serious breakage to me.
Martin KaFai Lau June 10, 2019, 7:42 p.m. UTC | #21
On Sat, Jun 08, 2019 at 05:47:07PM +0200, Stefano Brivio wrote:
> On Sat, 8 Jun 2019 17:02:06 +0200
> Stefano Brivio <sbrivio@redhat.com> wrote:
> 
> > On Sat, 8 Jun 2019 07:19:23 +0000
> > Martin Lau <kafai@fb.com> wrote:
> > 
> > > On Sat, Jun 08, 2019 at 07:59:11AM +0200, Stefano Brivio wrote:  
> > > > I also agree it makes more sense to filter routes this way.
> > > > 
> > > > But it wasn't like this before 2b760fcf5cfb, so this smells like
> > > > breaking userspace expectations, even though iproute already filters
> > > > routes this way: with 'cache' it only displays routes with
> > > > RTM_F_CLONED, without, it won't display exceptions, see filter_nlmsg():    
> > > Thanks for pointing it out.
> > >   
> > > > 	if (filter.cloned == !(r->rtm_flags & RTM_F_CLONED))
> > > > 		return 0;
> > > > 
> > > > This, together with the fact it's been like that for almost two years
> > > > now, makes it acceptable in my opinion. What do you think?    
> > > With learning the above fact on iproute2,
> > > it makes even less sense to dump exceptions from the kernel side
> > > when RTM_F_CLONED is not set.  
> > 
> > I just hit a more fundamental problem though: iproute2 filters on the
> > flag, but never sets it on a dump request. Flags will be NLM_F_DUMP |
> > NLM_F_REQUEST, no matter what, see rtnl_routedump_req(). So the current
> > iproute2 would have no way to dump cached routes.
> 
> Partially wrong: it actually sets it on 'list':
> 
> 	if (rtnl_routedump_req(&rth, dump_family, iproute_dump_filter) < 0) {
> 
> [...]
> static int iproute_dump_filter(struct nlmsghdr *nlh, int reqlen)
> [...]
> 	if (filter.cloned)
> 		rtm->rtm_flags |= RTM_F_CLONED;
> 
> but not on 'flush':
> 
> 		if (rtnl_routedump_req(&rth, family, NULL) < 0) {
> 
> but this doesn't change things much: it still has no way to flush the
> cache, because the dump to get the routes to flush doesn't contain the
> exceptions.
'ip -6 r l table cache' can be limited to dump the cache only, right?

I am still missing something about why the kernel is required
to output everything and then filtered out in the iproute2.

You meant either:
The kernel needs to dump everything first. iproute2 can then figure out
which one is cache and then flush them?
or
the iproute2 can be changed to only get the cache from the kernel and then
flush them?

AFAIK, the kernel has never dumped the cache routes for IPv4.
What is done here has to be consistent with the future patch in IPv4.
Each node can hold up to 5*1024 caches which is ok-ish but still a waste
to dump it and then not printing it.
Stefano Brivio June 10, 2019, 9:01 p.m. UTC | #22
On Mon, 10 Jun 2019 19:42:41 +0000
Martin Lau <kafai@fb.com> wrote:

> On Sat, Jun 08, 2019 at 05:47:07PM +0200, Stefano Brivio wrote:
> > On Sat, 8 Jun 2019 17:02:06 +0200
> > Stefano Brivio <sbrivio@redhat.com> wrote:
> >   
> > > On Sat, 8 Jun 2019 07:19:23 +0000
> > > Martin Lau <kafai@fb.com> wrote:
> > >   
> > > > On Sat, Jun 08, 2019 at 07:59:11AM +0200, Stefano Brivio wrote:    
> > > > > I also agree it makes more sense to filter routes this way.
> > > > > 
> > > > > But it wasn't like this before 2b760fcf5cfb, so this smells like
> > > > > breaking userspace expectations, even though iproute already filters
> > > > > routes this way: with 'cache' it only displays routes with
> > > > > RTM_F_CLONED, without, it won't display exceptions, see filter_nlmsg():      
> > > > Thanks for pointing it out.
> > > >     
> > > > > 	if (filter.cloned == !(r->rtm_flags & RTM_F_CLONED))
> > > > > 		return 0;
> > > > > 
> > > > > This, together with the fact it's been like that for almost two years
> > > > > now, makes it acceptable in my opinion. What do you think?      
> > > > With learning the above fact on iproute2,
> > > > it makes even less sense to dump exceptions from the kernel side
> > > > when RTM_F_CLONED is not set.    
> > > 
> > > I just hit a more fundamental problem though: iproute2 filters on the
> > > flag, but never sets it on a dump request. Flags will be NLM_F_DUMP |
> > > NLM_F_REQUEST, no matter what, see rtnl_routedump_req(). So the current
> > > iproute2 would have no way to dump cached routes.  
> > 
> > Partially wrong: it actually sets it on 'list':
> > 
> > 	if (rtnl_routedump_req(&rth, dump_family, iproute_dump_filter) < 0) {
> > 
> > [...]
> > static int iproute_dump_filter(struct nlmsghdr *nlh, int reqlen)
> > [...]
> > 	if (filter.cloned)
> > 		rtm->rtm_flags |= RTM_F_CLONED;
> > 
> > but not on 'flush':
> > 
> > 		if (rtnl_routedump_req(&rth, family, NULL) < 0) {
> > 
> > but this doesn't change things much: it still has no way to flush the
> > cache, because the dump to get the routes to flush doesn't contain the
> > exceptions.  
> 'ip -6 r l table cache' can be limited to dump the cache only, right?

Yes, at it was in v1 and v2. But that's arbitrary and inconsistent,
because without RTM_F_CLONED we would anyway need (with current
iproute2) to dump exceptions too.

RTM_F_CLONED just isn't a filter right now. Let's add support for
NLM_F_MATCH later and have clear semantics. I think this whole mess
comes from the fact we miss that.

> I am still missing something about why the kernel is required
> to output everything and then filtered out in the iproute2.
> 
> You meant either:
> The kernel needs to dump everything first. iproute2 can then figure out
> which one is cache and then flush them?

Yes, this is the case, right now.

> or
> the iproute2 can be changed to only get the cache from the kernel and then
> flush them?

I didn't mean this. In theory, we could, but I wouldn't fix a kernel
regression with a userspace change. I'm not adding NLM_F_MATCH support
in this fix because it wouldn't be a fix for past and current iproute2
versions, so it wouldn't be a fix.

> AFAIK, the kernel has never dumped the cache routes for IPv4.

Right now, it doesn't, but I can't believe that 'ip route list cache'
was implemented in iproute2 and it never worked. I haven't tested IPv4
with older kernels, though.

> What is done here has to be consistent with the future patch in IPv4.
> Each node can hold up to 5*1024 caches which is ok-ish but still a waste
> to dump it and then not printing it.

Indeed, I agree. For both IPv4 and IPv6, we need to have clear
semantics to avoid this waste. NLM_F_MATCH is specified by RFC 3549
exactly for this purpose, but we don't implement that.

Without NLM_F_MATCH, RTM_F_CLONED is just what the comment in UAPI says:

#define RTM_F_CLONED		0x200	/* This route is cloned		*/

So the future patch for IPv4 would be consistent: without NLM_F_MATCH
we'll dump everything and iproute2 filters (as it was until 2017 at
least for IPv6).

Then I'd add, for both IPv4 and IPv6, support for NLM_F_MATCH (both
for kernel and iproute2): kernel filters, sends NLM_F_DUMP_FILTERED
back and iproute2 doesn't (need to) filter. But this will be a (kind of
needed) optimisation, not a fix.
diff mbox series

Patch

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 008421b550c6..5be133565819 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -581,13 +581,10 @@  static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	} else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
 		struct rtmsg *rtm = nlmsg_data(nlh);
 
-		arg.filter.flags = rtm->rtm_flags & (RTM_F_PREFIX|RTM_F_CLONED);
+		if (rtm->rtm_flags & RTM_F_PREFIX)
+			arg.filter.flags = RTM_F_PREFIX;
 	}
 
-	/* fib entries are never clones */
-	if (arg.filter.flags & RTM_F_CLONED)
-		goto out;
-
 	w = (void *)cb->args[2];
 	if (!w) {
 		/* New dump:
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 848e944f07df..51f923b3ad26 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4862,8 +4862,11 @@  int rt6_dump_route(struct fib6_info *rt, void *p_arg)
 {
 	struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) p_arg;
 	struct fib_dump_filter *filter = &arg->filter;
+	struct rt6_exception_bucket *bucket;
 	unsigned int flags = NLM_F_MULTI;
+	struct rt6_exception *rt6_ex;
 	struct net *net = arg->net;
+	int i, err;
 
 	if (rt == net->ipv6.fib6_null_entry)
 		return 0;
@@ -4882,9 +4885,38 @@  int rt6_dump_route(struct fib6_info *rt, void *p_arg)
 		flags |= NLM_F_DUMP_FILTERED;
 	}
 
-	return rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0,
-			     RTM_NEWROUTE, NETLINK_CB(arg->cb->skb).portid,
-			     arg->cb->nlh->nlmsg_seq, flags);
+	if (!(filter->flags & RTM_F_CLONED)) {
+		err = rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0,
+				    RTM_NEWROUTE,
+				    NETLINK_CB(arg->cb->skb).portid,
+				    arg->cb->nlh->nlmsg_seq, flags);
+		if (err)
+			return err;
+	} else {
+		flags |= NLM_F_DUMP_FILTERED;
+	}
+
+	bucket = rcu_dereference(rt->rt6i_exception_bucket);
+	if (!bucket)
+		return 0;
+
+	for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) {
+		hlist_for_each_entry(rt6_ex, &bucket->chain, hlist) {
+			if (rt6_check_expired(rt6_ex->rt6i))
+				continue;
+
+			err = rt6_fill_node(net, arg->skb, rt,
+					    &rt6_ex->rt6i->dst,
+					    NULL, NULL, 0, RTM_NEWROUTE,
+					    NETLINK_CB(arg->cb->skb).portid,
+					    arg->cb->nlh->nlmsg_seq, flags);
+			if (err)
+				return err;
+		}
+		bucket++;
+	}
+
+	return 0;
 }
 
 static int inet6_rtm_valid_getroute_req(struct sk_buff *skb,