diff mbox

[v2,-next] net: fib: use fib result when zero-length prefix aliases exist

Message ID alpine.LFD.2.11.1507180045330.2410@ja.home.ssi.bg
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Julian Anastasov July 17, 2015, 10:47 p.m. UTC
Hello,

On Fri, 17 Jul 2015, Florian Westphal wrote:

> default route selection is not deterministic when TOS keys are used:

	In fact, TOS should be matched just like in
fib_table_lookup but it is not.

> This changes fib_select_default to not change the FIB chosen result EXCEPT
> if this nexthop appears to be unreachable.
> 
> fib_detect_death() logic is reversed -- we consider a nexthop 'dead' only
> if it has a neigh entry in unreachable state.

	The only difference I see is that NUD_NODE is
declared by fib_nud_is_unreach() as reachable. May be
it is better, for example, new route in NUD_NONE state
will be tried for a while, until its state is determined.

> +	if (likely(!fib_nud_is_unreach(res->fi)))
> +		return;

	here first is unreachable...

> +
> +	/* attempt to pick another nexthop */
>  	hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
>  		struct fib_info *next_fi = fa->fa_info;
>  
> @@ -1223,38 +1223,30 @@ void fib_select_default(struct fib_result *res)
>  		    next_fi->fib_nh[0].nh_scope != RT_SCOPE_LINK)
>  			continue;
>  
> +		order++;
> +
> +		if (next_fi == res->fi) /* already tested, not reachable */
> +			continue;
> +
>  		fib_alias_accessed(fa);
>  
> -		if (!fi) {
> -			if (next_fi != res->fi)
> +		if (fib_nud_is_unreach(next_fi))

			all others are unreachable too...

> +			continue;
> +
> +		/* try to round-robin among all fa_aliases in case
> +		 * res->fi nexthop is unreachable.

	round-robin only among reachables? But original algorithm
can round-robin among unreachables. State will not change
without trying some packets to unreachable dests.

> +		 */
> +		if (fi == NULL || order > tb->tb_default) {
> +			fi = next_fi;
> +			fi_idx = order;
> +			if (order > tb->tb_default)
>  				break;

	1=unreac, 2=reach, 3=tb_default(reach), 4=reach.
We like 2 (fi == NULL), why we continue after 2, skip
default 3 (because order > tb->tb_default is false) and
finally select 4 (order=4 > tb->tb_default=3)?

> +	tb->tb_default = fi_idx;

	And we do not round-robin if all look unreachable?

	I'm not yet sure if fib_detect_death logic should
be changed. What is strange is that we can switch between
two NUD_VALID entries...

	The __neigh_lookup_noref usage looks ok.
Not sure about the NUD_NODE change, may be it is ok to
be added in fib_detect_death. As for fib_select_default,
may be only change like below is needed. It additionally
restricts the alternative routes to same prefix, same TOS.
The TOS restriction was missing from long time but the
prefix check was lost recently when fa_head was changed
to contain aliases from different prefixes.


Regards

--
Julian Anastasov <ja@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Florian Westphal July 18, 2015, 6:05 p.m. UTC | #1
Julian Anastasov <ja@ssi.bg> wrote:

[ Dave, please toss my patch, its either v3 or something else entirely ]

> 	In fact, TOS should be matched just like in
> fib_table_lookup but it is not.
> 
> > This changes fib_select_default to not change the FIB chosen result EXCEPT
> > if this nexthop appears to be unreachable.
> > 
> > fib_detect_death() logic is reversed -- we consider a nexthop 'dead' only
> > if it has a neigh entry in unreachable state.
> 
> 	The only difference I see is that NUD_NODE is
> declared by fib_nud_is_unreach() as reachable. May be
> it is better, for example, new route in NUD_NONE state
> will be tried for a while, until its state is determined.

fib_detect_death considers neigh_lookup() returning NULL as unreach.

I don't think its good idea and should rather only skip if NUD is
failed or incomplete.

> > +	if (likely(!fib_nud_is_unreach(res->fi)))
> > +		return;
> 
> 	here first is unreachable...
> > +	/* attempt to pick another nexthop */
> >  	hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
> >  		struct fib_info *next_fi = fa->fa_info;
> >  
> > @@ -1223,38 +1223,30 @@ void fib_select_default(struct fib_result *res)
> >  		    next_fi->fib_nh[0].nh_scope != RT_SCOPE_LINK)
> >  			continue;
> >  
> > +		order++;
> > +
> > +		if (next_fi == res->fi) /* already tested, not reachable */
> > +			continue;
> > +
> >  		fib_alias_accessed(fa);
> >  
> > -		if (!fi) {
> > -			if (next_fi != res->fi)
> > +		if (fib_nud_is_unreach(next_fi))
> 
> 			all others are unreachable too...
> 
> > +			continue;
> > +
> > +		/* try to round-robin among all fa_aliases in case
> > +		 * res->fi nexthop is unreachable.
> 
> 	round-robin only among reachables? But original algorithm
> can round-robin among unreachables. State will not change
> without trying some packets to unreachable dests.

Hmm... thanks for pointing that out.

But thats confusing.  If thats true, why does fib_detect_death
even exist?  If we skip unreachables, they cannot become
reachable again...?

> > +			fi = next_fi;
> > +			fi_idx = order;
> > +			if (order > tb->tb_default)
> >  				break;
> 
> 	1=unreac, 2=reach, 3=tb_default(reach), 4=reach.
> We like 2 (fi == NULL), why we continue after 2, skip
> default 3 (because order > tb->tb_default is false) and
> finally select 4 (order=4 > tb->tb_default=3)?

Hmm.  Let me ask different question.
What is the supposed behaviour in the case you describe?

Removing order > default
test means we'll always pick first reachable in the list, no?

I don't think thats bad, but, how/why is is that 'better'?

> > +	tb->tb_default = fi_idx;
> 
> 	And we do not round-robin if all look unreachable?

Right.  One possible solution might be this patch, but alter
fib_nud_is_unreach() to also check timer_pending(n->timer), i.e.

if (state & (NUD_INCOMPLETE | NUD_FAILED))
	return timer_was_pending ? UNREACHABLE : REACHABLE;

This way we would re-try gw not only if neigh doesn't exist but
also if no retry is pending.

What do you think?

> As for fib_select_default,
> may be only change like below is needed. It additionally
> restricts the alternative routes to same prefix, same TOS.
> The TOS restriction was missing from long time but the
> prefix check was lost recently when fa_head was changed
> to contain aliases from different prefixes.

>  /* Must be invoked inside of an RCU protected region.  */
> -void fib_select_default(struct fib_result *res)
> +void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
>  {
>  	struct fib_info *fi = NULL, *last_resort = NULL;
>  	struct hlist_head *fa_head = res->fa_head;
>  	struct fib_table *tb = res->table;
> +	u8 slen = 32 - res->prefixlen;
>  	int order = -1, last_idx = -1;
>  	struct fib_alias *fa;
>  
>  	hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
>  		struct fib_info *next_fi = fa->fa_info;
>  
> +		if (fa->fa_slen != slen)
> +			continue;
>  		if (next_fi->fib_scope != res->scope ||
>  		    fa->fa_type != RTN_UNICAST)
>  			continue;
>  
> +		if (fa->tb_id != tb->tb_id)
> +			continue;
> +		if (fa->fa_tos && fa->fa_tos != flp->flowi4_tos)
> +			continue;

Hmm, this means we won't consider alternative if tos doesn't match,
except 0.

So with
ip route add tos 0x00 via 192.168.1.1
ip route add tos 0x04 via 192.168.1.2
ip route add tos 0x08 via 192.168.1.3
ip route add tos 0x0c via 192.168.1.4
ip route add tos 0x10 via 192.168.1.5

and ping -Q 1.2.3.4

if 1.5 is unreachable, we pick 1.1 (tos 0x0).

Others are not attempted.
Do you consider that 'correct'?  I don't see why
we can use tos 0x00 but not e.g. 0x04 in that case.

If .5 becomes available, we won't attempt it and still use 1.1/tos 0.

Thanks for looking into this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Julian Anastasov July 19, 2015, 11:43 a.m. UTC | #2
Hello,

On Sat, 18 Jul 2015, Florian Westphal wrote:

> Julian Anastasov <ja@ssi.bg> wrote:
> 
> > 	The only difference I see is that NUD_NODE is
> > declared by fib_nud_is_unreach() as reachable. May be
> > it is better, for example, new route in NUD_NONE state
> > will be tried for a while, until its state is determined.
> 
> fib_detect_death considers neigh_lookup() returning NULL as unreach.

	I mean, NUD_NODE is possible state, for example,
if entry is created by user space for silly reasons.
For example:

ip neigh add $IP dev $DEV nud none
ip neigh list nud none

	It is present and not used yet. Even ip route get
can not trigger neigh resolving, state will remain same.
Only traffic can trigger resolving.

> I don't think its good idea and should rather only skip if NUD is
> failed or incomplete.

	OK, then we are back to NUD_VALID usage.

> > 	round-robin only among reachables? But original algorithm
> > can round-robin among unreachables. State will not change
> > without trying some packets to unreachable dests.
> 
> Hmm... thanks for pointing that out.
> 
> But thats confusing.  If thats true, why does fib_detect_death
> even exist?  If we skip unreachables, they cannot become
> reachable again...?

	No, we return them one by one. If we direct traffic
to some alternative GW its state may switch to reachable. If
this GW is unreachable, the next try will select next alternative
GW. The algorithm tries to probe the unreachable GWs until
the traffic causes successful ARP resolving and selecting
one GW as reachable.

> > 	1=unreac, 2=reach, 3=tb_default(reach), 4=reach.
> > We like 2 (fi == NULL), why we continue after 2, skip
> > default 3 (because order > tb->tb_default is false) and
> > finally select 4 (order=4 > tb->tb_default=3)?
> 
> Hmm.  Let me ask different question.
> What is the supposed behaviour in the case you describe?

	I think, GW 2 should be selected as first reachable,
it should not be skipped.

	I decode fib_detect_death and fib_select_default as follows:

1. only one fi? keep tb_default=-1 no matter the state

2. prefer the first NUD_REACHABLE or NUD_VALID that is not tb_default.
I see the following cases:

2.1. NUD_REACHABLE is before another NUD_VALID => use only the 
NUD_REACHABLE

2.2. NUD_VALID is before another NUD_VALID/NUD_REACHABLE => switch
between them. I don't know the reason for this logic, though.
Note that NUD_REACHABLE is part of the NUD_VALID mask.

3. if there are no other NUD_VALID and the current (tb_default) is
NUD_VALID use it again

4. try all (!NUD_VALID) GWs one by one in a round-robin fashion by
increasing tb_default from -1 to max and then back to -1.
Such round-robin is attempted if there are no NUD_VALID GWs.
This is used after idle periods or if new GWs are added to
direct traffic for ARP probing.

> Removing order > default
> test means we'll always pick first reachable in the list, no?

	First reachable is ok, why not? But fib_detect_death
tries some differentiation between NUD_REACHABLE and
NUD_VALID.

> I don't think thats bad, but, how/why is is that 'better'?
> 
> > > +	tb->tb_default = fi_idx;
> > 
> > 	And we do not round-robin if all look unreachable?
> 
> Right.  One possible solution might be this patch, but alter
> fib_nud_is_unreach() to also check timer_pending(n->timer), i.e.
> 
> if (state & (NUD_INCOMPLETE | NUD_FAILED))
> 	return timer_was_pending ? UNREACHABLE : REACHABLE;

	Such timer_was_pending check is not needed. State
specifies if timer is running.

> This way we would re-try gw not only if neigh doesn't exist but
> also if no retry is pending.

	You mean:

- NUD_FAILED/NUD_NONE: no timer is pending, no current probing,
we can direct traffic that will set NUD_INCOMPLETE and will
send ARP probe

- NUD_INCOMPLETE: GW is currently probed, do not use it for
round-robin, direct traffic to another NUD_FAILED/NUD_NONE GW

	Still, above states should be considered "unreachable",
the difference will be only for the round-robin algorithm
when all GWs are unreachable.

> What do you think?

	But what do you think about above fib_detect_death
logic? I think, your tests fail just because ip route get
does not trigger traffic that will cause ARP resolution and
because fib_select_default is buggy to not check slen and
especially the tos.

> > +		if (fa->tb_id != tb->tb_id)
> > +			continue;
> > +		if (fa->fa_tos && fa->fa_tos != flp->flowi4_tos)
> > +			continue;
> 
> Hmm, this means we won't consider alternative if tos doesn't match,
> except 0.
> 
> So with
> ip route add tos 0x00 via 192.168.1.1
> ip route add tos 0x04 via 192.168.1.2
> ip route add tos 0x08 via 192.168.1.3
> ip route add tos 0x0c via 192.168.1.4
> ip route add tos 0x10 via 192.168.1.5
> 
> and ping -Q 1.2.3.4
> 
> if 1.5 is unreachable, we pick 1.1 (tos 0x0).
> 
> Others are not attempted.

	Yep

> Do you consider that 'correct'?  I don't see why
> we can use tos 0x00 but not e.g. 0x04 in that case.

	It is illegal to return alternative routes that should
be selected for another TOS, 192.168.1.2 should be selected
only for packets with TOS 0x04. And TOS 0x00 in packets can not
be matched by definition, it is used as 'match any TOS' value.

	Also, without TOS restriction the 'if (next_fi != res->fi)'
check will avoid alternatives for routes that have TOS with lower
value, for example:

ip route add tos 0x20 via 1.1.1.1
ip route add tos 0x20 via 2.2.2.2
ip route add tos 0x10 via 3.3.3.3
ip route add tos 0x10 via 4.4.4.4
ip route add tos 0x00 via 5.5.5.5

	fib_select_default from recent kernels always starts
from fa_head pointing to 1.1.1.1, so if res->fi points to the
same place the alternatives for TOS 0x20 will be considered.

	But if res->fi selects 3.3.3.3 for TOS 0x10 the
'if (next_fi != res->fi)' breaks and the 4.4.4.4 alternative
is ignored => tb_default=-1.

	In your example in first email, TOS 0x10 does
round-robin because it is the highest TOS and its alt
routes are considered. Due to the bug in fib_select_default
for lower TOS values no alt GWs are considered. That explains
the behaviour you see.

> If .5 becomes available, we won't attempt it and still use 1.1/tos 0.

	1.5 is before 1.1 in fa_head list, so it has priority
among the reachable/valid routes.

	The list with alternative routes should be:

- all routes with same TOS starting from the res->fi. The
TOS check in my patch will skip routes with higher TOS value
and then with lower TOS value that is not 0.

- if above TOS is not 0 we should consider all routes with TOS 0.
In above example, we have 3 possible lists of alt routes:

- IP.TOS 0x20: 1.1.1.1, 2.2.2.2, 5.5.5.5
- IP.TOS 0x10: 3.3.3.3, 4.4.4.4, 5.5.5.5
- other TOS: 5.5.5.5

	The funny part is that tb->tb_default is one
and can't remember the order for every list. May be
tb_default should be moved to fa (the first one for TOS).
In above example, the first fa is 1.1.1.1, 3.3.3.3
and 5.5.5.5.

	So, I think, there is no big reason to change
fib_detect_death. But fib_select_default needs to filter
by slen and TOS and may be to move tb_default to fa,
in case one wants to use TOS in alt routes.

Regards

--
Julian Anastasov <ja@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal July 19, 2015, 11:03 p.m. UTC | #3
Julian Anastasov <ja@ssi.bg> wrote:
> ip neigh add $IP dev $DEV nud none
> ip neigh list nud none
> 
> 	It is present and not used yet. Even ip route get
> can not trigger neigh resolving, state will remain same.
> Only traffic can trigger resolving.

Right.

> > > 	round-robin only among reachables? But original algorithm
> > > can round-robin among unreachables. State will not change
> > > without trying some packets to unreachable dests.
> > 
> > Hmm... thanks for pointing that out.
> > 
> > But thats confusing.  If thats true, why does fib_detect_death
> > even exist?  If we skip unreachables, they cannot become
> > reachable again...?
> 
> 	No, we return them one by one. If we direct traffic
> to some alternative GW its state may switch to reachable. If
> this GW is unreachable, the next try will select next alternative
> GW. The algorithm tries to probe the unreachable GWs until
> the traffic causes successful ARP resolving and selecting
> one GW as reachable.

Doesn't work for me (or I misunderstand). See below.

> 	I decode fib_detect_death and fib_select_default as follows:

[Description of algorithm]

> 2. prefer the first NUD_REACHABLE or NUD_VALID that is not tb_default.
> I see the following cases:
> 
> 2.1. NUD_REACHABLE is before another NUD_VALID => use only the 
> NUD_REACHABLE

Yes, and thats what I don't understand.

> 	You mean:
> 
> - NUD_FAILED/NUD_NONE: no timer is pending, no current probing,
> we can direct traffic that will set NUD_INCOMPLETE and will
> send ARP probe
> 
> - NUD_INCOMPLETE: GW is currently probed, do not use it for
> round-robin, direct traffic to another NUD_FAILED/NUD_NONE GW

Yes, thats what I meant.

> 	Still, above states should be considered "unreachable",
> the difference will be only for the round-robin algorithm
> when all GWs are unreachable.

> 	But what do you think about above fib_detect_death
> logic? I think, your tests fail just because ip route get
> does not trigger traffic that will cause ARP resolution and
> because fib_select_default is buggy to not check slen and
> especially the tos.

Mostly, yes.

> > So with
> > ip route add tos 0x00 via 192.168.1.1
> > ip route add tos 0x04 via 192.168.1.2
> > ip route add tos 0x08 via 192.168.1.3
> > ip route add tos 0x0c via 192.168.1.4
> > ip route add tos 0x10 via 192.168.1.5
> > 
> > and ping -Q 1.2.3.4
> > 
> > if 1.5 is unreachable, we pick 1.1 (tos 0x0).
> > 
> > Others are not attempted.
> 
> 	Yep
> 
> > Do you consider that 'correct'?  I don't see why
> > we can use tos 0x00 but not e.g. 0x04 in that case.
> 
> 	It is illegal to return alternative routes that should
> be selected for another TOS, 192.168.1.2 should be selected
> only for packets with TOS 0x04. And TOS 0x00 in packets can not
> be matched by definition, it is used as 'match any TOS' value.

Ok.

> 	Also, without TOS restriction the 'if (next_fi != res->fi)'
> check will avoid alternatives for routes that have TOS with lower
> value, for example:
> 
> ip route add tos 0x20 via 1.1.1.1
> ip route add tos 0x20 via 2.2.2.2
> ip route add tos 0x10 via 3.3.3.3
> ip route add tos 0x10 via 4.4.4.4
> ip route add tos 0x00 via 5.5.5.5
> 
> 	fib_select_default from recent kernels always starts
> from fa_head pointing to 1.1.1.1, so if res->fi points to the
> same place the alternatives for TOS 0x20 will be considered.
> 
> 	But if res->fi selects 3.3.3.3 for TOS 0x10 the
> 'if (next_fi != res->fi)' breaks and the 4.4.4.4 alternative
> is ignored => tb_default=-1.
> 
> 	In your example in first email, TOS 0x10 does
> round-robin because it is the highest TOS and its alt
> routes are considered. Due to the bug in fib_select_default
> for lower TOS values no alt GWs are considered. That explains
> the behaviour you see.

Yes.

> > If .5 becomes available, we won't attempt it and still use 1.1/tos 0.
> 
> 	1.5 is before 1.1 in fa_head list, so it has priority
> among the reachable/valid routes.
> 
> 	The list with alternative routes should be:
> 
> - all routes with same TOS starting from the res->fi. The
> TOS check in my patch will skip routes with higher TOS value
> and then with lower TOS value that is not 0.
> 
> - if above TOS is not 0 we should consider all routes with TOS 0.
> In above example, we have 3 possible lists of alt routes:
> 
> - IP.TOS 0x20: 1.1.1.1, 2.2.2.2, 5.5.5.5
> - IP.TOS 0x10: 3.3.3.3, 4.4.4.4, 5.5.5.5
> - other TOS: 5.5.5.5
> 
> 	The funny part is that tb->tb_default is one
> and can't remember the order for every list. May be
> tb_default should be moved to fa (the first one for TOS).
> In above example, the first fa is 1.1.1.1, 3.3.3.3
> and 5.5.5.5.
> 
> 	So, I think, there is no big reason to change
> fib_detect_death. But fib_select_default needs to filter
> by slen and TOS and may be to move tb_default to fa,
> in case one wants to use TOS in alt routes.

Let me give you an example why I still think that fib_detect_death
is not optimal.

Test VM is running net-next with your proposed patch on top.

Its much better, the weird round-robin behaviour reported is gone.

The VM has two interfaces,
eth0, 192.168.7.10
eth1, 192.168.8.10

ip route del default
ip route add tos 0x0 via 192.168.7.1
ip route add tos 0x10 via 192.168.8.2

7.1 is reachable via eth0 (7.10/24)
8.2 *should* be rechable via eth1 (8.10/24)

I say *should* because I deliberately deleted this address from gateway
connected to that interface.

Now, I run

ping -Q 0x10 192.168.0.7

Packets get sent via tos 0x0.  So far, so good.

Now I add back 192.168.8.2.

But no probe takes place so packets continue to be sent via tos 0 route.

neigh_lookup returns NULL -> state is NUD_NONE -> gw is deemed unreachable

If I ping in the other direction from 192.168.8.2 packets are steered to
the 8.2/tos 0x10 gateway.

ip neigh flush nud reachable
on the test vm also makes packets take the tos 0x10 gateway.

The reverse (.8.2 available, packets sent via tos 0x10 use it, 8.2
address is removed) works: after a few seconds entry moves to STALE,
then FAILED, packets get sent via tos 0x0 gateway.

In any case, I don't want to hold this up, your proposed patch fixes
the TOS-0x10-always-round-robins issue.

Thanks for your detailed analysis & help with this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 49c142b..0b1af68 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -290,7 +290,7 @@  __be32 fib_compute_spec_dst(struct sk_buff *skb);
 int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 			u8 tos, int oif, struct net_device *dev,
 			struct in_device *idev, u32 *itag);
-void fib_select_default(struct fib_result *res);
+void fib_select_default(const struct flowi4 *flp, struct fib_result *res);
 #ifdef CONFIG_IP_ROUTE_CLASSID
 static inline int fib_num_tclassid_users(struct net *net)
 {
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c7358ea..088b2a5 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1202,21 +1202,29 @@  int fib_sync_down_dev(struct net_device *dev, unsigned long event)
 }
 
 /* Must be invoked inside of an RCU protected region.  */
-void fib_select_default(struct fib_result *res)
+void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
 {
 	struct fib_info *fi = NULL, *last_resort = NULL;
 	struct hlist_head *fa_head = res->fa_head;
 	struct fib_table *tb = res->table;
+	u8 slen = 32 - res->prefixlen;
 	int order = -1, last_idx = -1;
 	struct fib_alias *fa;
 
 	hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
 		struct fib_info *next_fi = fa->fa_info;
 
+		if (fa->fa_slen != slen)
+			continue;
 		if (next_fi->fib_scope != res->scope ||
 		    fa->fa_type != RTN_UNICAST)
 			continue;
 
+		if (fa->tb_id != tb->tb_id)
+			continue;
+		if (fa->fa_tos && fa->fa_tos != flp->flowi4_tos)
+			continue;
+
 		if (next_fi->fib_priority > res->fi->fib_priority)
 			break;
 		if (!next_fi->fib_nh[0].nh_gw ||
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index d0362a2..e681b85 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2176,7 +2176,7 @@  struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 	if (!res.prefixlen &&
 	    res.table->tb_num_default > 1 &&
 	    res.type == RTN_UNICAST && !fl4->flowi4_oif)
-		fib_select_default(&res);
+		fib_select_default(fl4, &res);
 
 	if (!fl4->saddr)
 		fl4->saddr = FIB_RES_PREFSRC(net, res);