Message ID | alpine.LFD.2.11.1507180045330.2410@ja.home.ssi.bg |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Julian Anastasov <ja@ssi.bg> wrote: [ Dave, please toss my patch, its either v3 or something else entirely ] > In fact, TOS should be matched just like in > fib_table_lookup but it is not. > > > This changes fib_select_default to not change the FIB chosen result EXCEPT > > if this nexthop appears to be unreachable. > > > > fib_detect_death() logic is reversed -- we consider a nexthop 'dead' only > > if it has a neigh entry in unreachable state. > > The only difference I see is that NUD_NODE is > declared by fib_nud_is_unreach() as reachable. May be > it is better, for example, new route in NUD_NONE state > will be tried for a while, until its state is determined. fib_detect_death considers neigh_lookup() returning NULL as unreach. I don't think its good idea and should rather only skip if NUD is failed or incomplete. > > + if (likely(!fib_nud_is_unreach(res->fi))) > > + return; > > here first is unreachable... > > + /* attempt to pick another nexthop */ > > hlist_for_each_entry_rcu(fa, fa_head, fa_list) { > > struct fib_info *next_fi = fa->fa_info; > > > > @@ -1223,38 +1223,30 @@ void fib_select_default(struct fib_result *res) > > next_fi->fib_nh[0].nh_scope != RT_SCOPE_LINK) > > continue; > > > > + order++; > > + > > + if (next_fi == res->fi) /* already tested, not reachable */ > > + continue; > > + > > fib_alias_accessed(fa); > > > > - if (!fi) { > > - if (next_fi != res->fi) > > + if (fib_nud_is_unreach(next_fi)) > > all others are unreachable too... > > > + continue; > > + > > + /* try to round-robin among all fa_aliases in case > > + * res->fi nexthop is unreachable. > > round-robin only among reachables? But original algorithm > can round-robin among unreachables. State will not change > without trying some packets to unreachable dests. Hmm... thanks for pointing that out. But thats confusing. If thats true, why does fib_detect_death even exist? If we skip unreachables, they cannot become reachable again...? > > + fi = next_fi; > > + fi_idx = order; > > + if (order > tb->tb_default) > > break; > > 1=unreac, 2=reach, 3=tb_default(reach), 4=reach. > We like 2 (fi == NULL), why we continue after 2, skip > default 3 (because order > tb->tb_default is false) and > finally select 4 (order=4 > tb->tb_default=3)? Hmm. Let me ask different question. What is the supposed behaviour in the case you describe? Removing order > default test means we'll always pick first reachable in the list, no? I don't think thats bad, but, how/why is is that 'better'? > > + tb->tb_default = fi_idx; > > And we do not round-robin if all look unreachable? Right. One possible solution might be this patch, but alter fib_nud_is_unreach() to also check timer_pending(n->timer), i.e. if (state & (NUD_INCOMPLETE | NUD_FAILED)) return timer_was_pending ? UNREACHABLE : REACHABLE; This way we would re-try gw not only if neigh doesn't exist but also if no retry is pending. What do you think? > As for fib_select_default, > may be only change like below is needed. It additionally > restricts the alternative routes to same prefix, same TOS. > The TOS restriction was missing from long time but the > prefix check was lost recently when fa_head was changed > to contain aliases from different prefixes. > /* Must be invoked inside of an RCU protected region. */ > -void fib_select_default(struct fib_result *res) > +void fib_select_default(const struct flowi4 *flp, struct fib_result *res) > { > struct fib_info *fi = NULL, *last_resort = NULL; > struct hlist_head *fa_head = res->fa_head; > struct fib_table *tb = res->table; > + u8 slen = 32 - res->prefixlen; > int order = -1, last_idx = -1; > struct fib_alias *fa; > > hlist_for_each_entry_rcu(fa, fa_head, fa_list) { > struct fib_info *next_fi = fa->fa_info; > > + if (fa->fa_slen != slen) > + continue; > if (next_fi->fib_scope != res->scope || > fa->fa_type != RTN_UNICAST) > continue; > > + if (fa->tb_id != tb->tb_id) > + continue; > + if (fa->fa_tos && fa->fa_tos != flp->flowi4_tos) > + continue; Hmm, this means we won't consider alternative if tos doesn't match, except 0. So with ip route add tos 0x00 via 192.168.1.1 ip route add tos 0x04 via 192.168.1.2 ip route add tos 0x08 via 192.168.1.3 ip route add tos 0x0c via 192.168.1.4 ip route add tos 0x10 via 192.168.1.5 and ping -Q 1.2.3.4 if 1.5 is unreachable, we pick 1.1 (tos 0x0). Others are not attempted. Do you consider that 'correct'? I don't see why we can use tos 0x00 but not e.g. 0x04 in that case. If .5 becomes available, we won't attempt it and still use 1.1/tos 0. Thanks for looking into this. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Sat, 18 Jul 2015, Florian Westphal wrote: > Julian Anastasov <ja@ssi.bg> wrote: > > > The only difference I see is that NUD_NODE is > > declared by fib_nud_is_unreach() as reachable. May be > > it is better, for example, new route in NUD_NONE state > > will be tried for a while, until its state is determined. > > fib_detect_death considers neigh_lookup() returning NULL as unreach. I mean, NUD_NODE is possible state, for example, if entry is created by user space for silly reasons. For example: ip neigh add $IP dev $DEV nud none ip neigh list nud none It is present and not used yet. Even ip route get can not trigger neigh resolving, state will remain same. Only traffic can trigger resolving. > I don't think its good idea and should rather only skip if NUD is > failed or incomplete. OK, then we are back to NUD_VALID usage. > > round-robin only among reachables? But original algorithm > > can round-robin among unreachables. State will not change > > without trying some packets to unreachable dests. > > Hmm... thanks for pointing that out. > > But thats confusing. If thats true, why does fib_detect_death > even exist? If we skip unreachables, they cannot become > reachable again...? No, we return them one by one. If we direct traffic to some alternative GW its state may switch to reachable. If this GW is unreachable, the next try will select next alternative GW. The algorithm tries to probe the unreachable GWs until the traffic causes successful ARP resolving and selecting one GW as reachable. > > 1=unreac, 2=reach, 3=tb_default(reach), 4=reach. > > We like 2 (fi == NULL), why we continue after 2, skip > > default 3 (because order > tb->tb_default is false) and > > finally select 4 (order=4 > tb->tb_default=3)? > > Hmm. Let me ask different question. > What is the supposed behaviour in the case you describe? I think, GW 2 should be selected as first reachable, it should not be skipped. I decode fib_detect_death and fib_select_default as follows: 1. only one fi? keep tb_default=-1 no matter the state 2. prefer the first NUD_REACHABLE or NUD_VALID that is not tb_default. I see the following cases: 2.1. NUD_REACHABLE is before another NUD_VALID => use only the NUD_REACHABLE 2.2. NUD_VALID is before another NUD_VALID/NUD_REACHABLE => switch between them. I don't know the reason for this logic, though. Note that NUD_REACHABLE is part of the NUD_VALID mask. 3. if there are no other NUD_VALID and the current (tb_default) is NUD_VALID use it again 4. try all (!NUD_VALID) GWs one by one in a round-robin fashion by increasing tb_default from -1 to max and then back to -1. Such round-robin is attempted if there are no NUD_VALID GWs. This is used after idle periods or if new GWs are added to direct traffic for ARP probing. > Removing order > default > test means we'll always pick first reachable in the list, no? First reachable is ok, why not? But fib_detect_death tries some differentiation between NUD_REACHABLE and NUD_VALID. > I don't think thats bad, but, how/why is is that 'better'? > > > > + tb->tb_default = fi_idx; > > > > And we do not round-robin if all look unreachable? > > Right. One possible solution might be this patch, but alter > fib_nud_is_unreach() to also check timer_pending(n->timer), i.e. > > if (state & (NUD_INCOMPLETE | NUD_FAILED)) > return timer_was_pending ? UNREACHABLE : REACHABLE; Such timer_was_pending check is not needed. State specifies if timer is running. > This way we would re-try gw not only if neigh doesn't exist but > also if no retry is pending. You mean: - NUD_FAILED/NUD_NONE: no timer is pending, no current probing, we can direct traffic that will set NUD_INCOMPLETE and will send ARP probe - NUD_INCOMPLETE: GW is currently probed, do not use it for round-robin, direct traffic to another NUD_FAILED/NUD_NONE GW Still, above states should be considered "unreachable", the difference will be only for the round-robin algorithm when all GWs are unreachable. > What do you think? But what do you think about above fib_detect_death logic? I think, your tests fail just because ip route get does not trigger traffic that will cause ARP resolution and because fib_select_default is buggy to not check slen and especially the tos. > > + if (fa->tb_id != tb->tb_id) > > + continue; > > + if (fa->fa_tos && fa->fa_tos != flp->flowi4_tos) > > + continue; > > Hmm, this means we won't consider alternative if tos doesn't match, > except 0. > > So with > ip route add tos 0x00 via 192.168.1.1 > ip route add tos 0x04 via 192.168.1.2 > ip route add tos 0x08 via 192.168.1.3 > ip route add tos 0x0c via 192.168.1.4 > ip route add tos 0x10 via 192.168.1.5 > > and ping -Q 1.2.3.4 > > if 1.5 is unreachable, we pick 1.1 (tos 0x0). > > Others are not attempted. Yep > Do you consider that 'correct'? I don't see why > we can use tos 0x00 but not e.g. 0x04 in that case. It is illegal to return alternative routes that should be selected for another TOS, 192.168.1.2 should be selected only for packets with TOS 0x04. And TOS 0x00 in packets can not be matched by definition, it is used as 'match any TOS' value. Also, without TOS restriction the 'if (next_fi != res->fi)' check will avoid alternatives for routes that have TOS with lower value, for example: ip route add tos 0x20 via 1.1.1.1 ip route add tos 0x20 via 2.2.2.2 ip route add tos 0x10 via 3.3.3.3 ip route add tos 0x10 via 4.4.4.4 ip route add tos 0x00 via 5.5.5.5 fib_select_default from recent kernels always starts from fa_head pointing to 1.1.1.1, so if res->fi points to the same place the alternatives for TOS 0x20 will be considered. But if res->fi selects 3.3.3.3 for TOS 0x10 the 'if (next_fi != res->fi)' breaks and the 4.4.4.4 alternative is ignored => tb_default=-1. In your example in first email, TOS 0x10 does round-robin because it is the highest TOS and its alt routes are considered. Due to the bug in fib_select_default for lower TOS values no alt GWs are considered. That explains the behaviour you see. > If .5 becomes available, we won't attempt it and still use 1.1/tos 0. 1.5 is before 1.1 in fa_head list, so it has priority among the reachable/valid routes. The list with alternative routes should be: - all routes with same TOS starting from the res->fi. The TOS check in my patch will skip routes with higher TOS value and then with lower TOS value that is not 0. - if above TOS is not 0 we should consider all routes with TOS 0. In above example, we have 3 possible lists of alt routes: - IP.TOS 0x20: 1.1.1.1, 2.2.2.2, 5.5.5.5 - IP.TOS 0x10: 3.3.3.3, 4.4.4.4, 5.5.5.5 - other TOS: 5.5.5.5 The funny part is that tb->tb_default is one and can't remember the order for every list. May be tb_default should be moved to fa (the first one for TOS). In above example, the first fa is 1.1.1.1, 3.3.3.3 and 5.5.5.5. So, I think, there is no big reason to change fib_detect_death. But fib_select_default needs to filter by slen and TOS and may be to move tb_default to fa, in case one wants to use TOS in alt routes. Regards -- Julian Anastasov <ja@ssi.bg> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Julian Anastasov <ja@ssi.bg> wrote: > ip neigh add $IP dev $DEV nud none > ip neigh list nud none > > It is present and not used yet. Even ip route get > can not trigger neigh resolving, state will remain same. > Only traffic can trigger resolving. Right. > > > round-robin only among reachables? But original algorithm > > > can round-robin among unreachables. State will not change > > > without trying some packets to unreachable dests. > > > > Hmm... thanks for pointing that out. > > > > But thats confusing. If thats true, why does fib_detect_death > > even exist? If we skip unreachables, they cannot become > > reachable again...? > > No, we return them one by one. If we direct traffic > to some alternative GW its state may switch to reachable. If > this GW is unreachable, the next try will select next alternative > GW. The algorithm tries to probe the unreachable GWs until > the traffic causes successful ARP resolving and selecting > one GW as reachable. Doesn't work for me (or I misunderstand). See below. > I decode fib_detect_death and fib_select_default as follows: [Description of algorithm] > 2. prefer the first NUD_REACHABLE or NUD_VALID that is not tb_default. > I see the following cases: > > 2.1. NUD_REACHABLE is before another NUD_VALID => use only the > NUD_REACHABLE Yes, and thats what I don't understand. > You mean: > > - NUD_FAILED/NUD_NONE: no timer is pending, no current probing, > we can direct traffic that will set NUD_INCOMPLETE and will > send ARP probe > > - NUD_INCOMPLETE: GW is currently probed, do not use it for > round-robin, direct traffic to another NUD_FAILED/NUD_NONE GW Yes, thats what I meant. > Still, above states should be considered "unreachable", > the difference will be only for the round-robin algorithm > when all GWs are unreachable. > But what do you think about above fib_detect_death > logic? I think, your tests fail just because ip route get > does not trigger traffic that will cause ARP resolution and > because fib_select_default is buggy to not check slen and > especially the tos. Mostly, yes. > > So with > > ip route add tos 0x00 via 192.168.1.1 > > ip route add tos 0x04 via 192.168.1.2 > > ip route add tos 0x08 via 192.168.1.3 > > ip route add tos 0x0c via 192.168.1.4 > > ip route add tos 0x10 via 192.168.1.5 > > > > and ping -Q 1.2.3.4 > > > > if 1.5 is unreachable, we pick 1.1 (tos 0x0). > > > > Others are not attempted. > > Yep > > > Do you consider that 'correct'? I don't see why > > we can use tos 0x00 but not e.g. 0x04 in that case. > > It is illegal to return alternative routes that should > be selected for another TOS, 192.168.1.2 should be selected > only for packets with TOS 0x04. And TOS 0x00 in packets can not > be matched by definition, it is used as 'match any TOS' value. Ok. > Also, without TOS restriction the 'if (next_fi != res->fi)' > check will avoid alternatives for routes that have TOS with lower > value, for example: > > ip route add tos 0x20 via 1.1.1.1 > ip route add tos 0x20 via 2.2.2.2 > ip route add tos 0x10 via 3.3.3.3 > ip route add tos 0x10 via 4.4.4.4 > ip route add tos 0x00 via 5.5.5.5 > > fib_select_default from recent kernels always starts > from fa_head pointing to 1.1.1.1, so if res->fi points to the > same place the alternatives for TOS 0x20 will be considered. > > But if res->fi selects 3.3.3.3 for TOS 0x10 the > 'if (next_fi != res->fi)' breaks and the 4.4.4.4 alternative > is ignored => tb_default=-1. > > In your example in first email, TOS 0x10 does > round-robin because it is the highest TOS and its alt > routes are considered. Due to the bug in fib_select_default > for lower TOS values no alt GWs are considered. That explains > the behaviour you see. Yes. > > If .5 becomes available, we won't attempt it and still use 1.1/tos 0. > > 1.5 is before 1.1 in fa_head list, so it has priority > among the reachable/valid routes. > > The list with alternative routes should be: > > - all routes with same TOS starting from the res->fi. The > TOS check in my patch will skip routes with higher TOS value > and then with lower TOS value that is not 0. > > - if above TOS is not 0 we should consider all routes with TOS 0. > In above example, we have 3 possible lists of alt routes: > > - IP.TOS 0x20: 1.1.1.1, 2.2.2.2, 5.5.5.5 > - IP.TOS 0x10: 3.3.3.3, 4.4.4.4, 5.5.5.5 > - other TOS: 5.5.5.5 > > The funny part is that tb->tb_default is one > and can't remember the order for every list. May be > tb_default should be moved to fa (the first one for TOS). > In above example, the first fa is 1.1.1.1, 3.3.3.3 > and 5.5.5.5. > > So, I think, there is no big reason to change > fib_detect_death. But fib_select_default needs to filter > by slen and TOS and may be to move tb_default to fa, > in case one wants to use TOS in alt routes. Let me give you an example why I still think that fib_detect_death is not optimal. Test VM is running net-next with your proposed patch on top. Its much better, the weird round-robin behaviour reported is gone. The VM has two interfaces, eth0, 192.168.7.10 eth1, 192.168.8.10 ip route del default ip route add tos 0x0 via 192.168.7.1 ip route add tos 0x10 via 192.168.8.2 7.1 is reachable via eth0 (7.10/24) 8.2 *should* be rechable via eth1 (8.10/24) I say *should* because I deliberately deleted this address from gateway connected to that interface. Now, I run ping -Q 0x10 192.168.0.7 Packets get sent via tos 0x0. So far, so good. Now I add back 192.168.8.2. But no probe takes place so packets continue to be sent via tos 0 route. neigh_lookup returns NULL -> state is NUD_NONE -> gw is deemed unreachable If I ping in the other direction from 192.168.8.2 packets are steered to the 8.2/tos 0x10 gateway. ip neigh flush nud reachable on the test vm also makes packets take the tos 0x10 gateway. The reverse (.8.2 available, packets sent via tos 0x10 use it, 8.2 address is removed) works: after a few seconds entry moves to STALE, then FAILED, packets get sent via tos 0x0 gateway. In any case, I don't want to hold this up, your proposed patch fixes the TOS-0x10-always-round-robins issue. Thanks for your detailed analysis & help with this. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 49c142b..0b1af68 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -290,7 +290,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb); int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, u8 tos, int oif, struct net_device *dev, struct in_device *idev, u32 *itag); -void fib_select_default(struct fib_result *res); +void fib_select_default(const struct flowi4 *flp, struct fib_result *res); #ifdef CONFIG_IP_ROUTE_CLASSID static inline int fib_num_tclassid_users(struct net *net) { diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index c7358ea..088b2a5 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -1202,21 +1202,29 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event) } /* Must be invoked inside of an RCU protected region. */ -void fib_select_default(struct fib_result *res) +void fib_select_default(const struct flowi4 *flp, struct fib_result *res) { struct fib_info *fi = NULL, *last_resort = NULL; struct hlist_head *fa_head = res->fa_head; struct fib_table *tb = res->table; + u8 slen = 32 - res->prefixlen; int order = -1, last_idx = -1; struct fib_alias *fa; hlist_for_each_entry_rcu(fa, fa_head, fa_list) { struct fib_info *next_fi = fa->fa_info; + if (fa->fa_slen != slen) + continue; if (next_fi->fib_scope != res->scope || fa->fa_type != RTN_UNICAST) continue; + if (fa->tb_id != tb->tb_id) + continue; + if (fa->fa_tos && fa->fa_tos != flp->flowi4_tos) + continue; + if (next_fi->fib_priority > res->fi->fib_priority) break; if (!next_fi->fib_nh[0].nh_gw || diff --git a/net/ipv4/route.c b/net/ipv4/route.c index d0362a2..e681b85 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2176,7 +2176,7 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) if (!res.prefixlen && res.table->tb_num_default > 1 && res.type == RTN_UNICAST && !fl4->flowi4_oif) - fib_select_default(&res); + fib_select_default(fl4, &res); if (!fl4->saddr) fl4->saddr = FIB_RES_PREFSRC(net, res);