[net] ipv4: Avoid caching dsts when lookup skipped nh oif check

Message ID 1492693132-4708-1-git-send-email-rshearma@brocade.com
State Superseded
Delegated to: David Miller
Headers show

Commit Message

Robert Shearman April 20, 2017, 12:58 p.m.
David reported that doing the following:

    ip li add red type vrf table 10
    ip link set dev eth1 vrf red
    ip addr add 127.0.0.1/8 dev red
    ip link set dev eth1 up
    ip li set red up
    ping -c1 -w1 -I red 127.0.0.1
    ip li del red

results in a hang with this message:

    unregister_netdevice: waiting for red to become free. Usage count = 1

The problem is caused by caching the dst used for sending the packet
out of the specified interface on the route that the lookup returned
from the local table when the rule for the lookup in the local table
is ordered before the rule for lookups using l3mdevs. Thus the dst
could stay around until the route in the local table is deleted which
may be never.

Address the problem by not allocating a cacheable output dst if
FLOWI_FLAG_SKIP_NH_OIF is set and the nh device differs from the
device used for the dst.

Fixes: ebfc102c566d ("net: vrf: Flip IPv4 output path from FIB lookup hook to out hook")
Reported-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
---
 net/ipv4/route.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

David Ahern April 20, 2017, 2:21 p.m. | #1
On 4/20/17 6:58 AM, Robert Shearman wrote:
> David reported that doing the following:
> 
>     ip li add red type vrf table 10
>     ip link set dev eth1 vrf red
>     ip addr add 127.0.0.1/8 dev red
>     ip link set dev eth1 up
>     ip li set red up
>     ping -c1 -w1 -I red 127.0.0.1
>     ip li del red
> 
> results in a hang with this message:
> 
>     unregister_netdevice: waiting for red to become free. Usage count = 1

I think you misunderstood my comment. The above works fine today. There
is no bug with refcnts.

It breaks with your patches wanting to use a VRF device with the main
table (254).
Robert Shearman April 20, 2017, 2:39 p.m. | #2
On 20/04/17 15:21, David Ahern wrote:
> On 4/20/17 6:58 AM, Robert Shearman wrote:
>> David reported that doing the following:
>>
>>     ip li add red type vrf table 10
>>     ip link set dev eth1 vrf red
>>     ip addr add 127.0.0.1/8 dev red
>>     ip link set dev eth1 up
>>     ip li set red up
>>     ping -c1 -w1 -I red 127.0.0.1
>>     ip li del red
>>
>> results in a hang with this message:
>>
>>     unregister_netdevice: waiting for red to become free. Usage count = 1
>
> I think you misunderstood my comment. The above works fine today. There
> is no bug with refcnts.
>
> It breaks with your patches wanting to use a VRF device with the main
> table (254).

That doesn't seem to match with my experience. I can reproduce this on 
the net tree with the listed commands and the behaviour matches what I 
see in the code.

Thanks,
Rob
David Ahern April 20, 2017, 2:59 p.m. | #3
On 4/20/17 8:39 AM, Robert Shearman wrote:
> On 20/04/17 15:21, David Ahern wrote:
>> On 4/20/17 6:58 AM, Robert Shearman wrote:
>>> David reported that doing the following:
>>>
>>>     ip li add red type vrf table 10
>>>     ip link set dev eth1 vrf red
>>>     ip addr add 127.0.0.1/8 dev red
>>>     ip link set dev eth1 up
>>>     ip li set red up
>>>     ping -c1 -w1 -I red 127.0.0.1
>>>     ip li del red
>>>
>>> results in a hang with this message:
>>>
>>>     unregister_netdevice: waiting for red to become free. Usage count
>>> = 1
>>
>> I think you misunderstood my comment. The above works fine today. There
>> is no bug with refcnts.
>>
>> It breaks with your patches wanting to use a VRF device with the main
>> table (254).
> 
> That doesn't seem to match with my experience. I can reproduce this on
> the net tree with the listed commands and the behaviour matches what I
> see in the code.

Our mileage varies:

root@kenny-jessie3:~# ip li add red type vrf table 10
root@kenny-jessie3:~#     ip link set dev eth1 vrf red
root@kenny-jessie3:~#     ip addr add 127.0.0.1/8 dev red
root@kenny-jessie3:~#     ip link set dev eth1 up
root@kenny-jessie3:~#     ip li set red up
root@kenny-jessie3:~#     ping -c1 -w1 -I red 127.0.0.1
PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.059 ms

--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.059/0.059/0.059/0.000 ms
root@kenny-jessie3:~#     ip li del red
root@kenny-jessie3:~# uname -a
Linux kenny-jessie3 4.11.0-rc6+ #8 SMP Wed Apr 19 11:53:48 PDT 2017
x86_64 GNU/Linux

The above is one of many tests I run and never hit a problem deleting a
VRF device. dst's attached to fnhe_rth_output and fnhe_rth_input appear
to be properly flushed and device references released when the device is
deleted (NETDEV_DOWN and then NETDEV_UNREGISTER).

Can you send me your kernel config and "sysctl -a --pattern 'net.ipv4'"?
Perhaps you have something enabled I don't.
Robert Shearman April 20, 2017, 3:05 p.m. | #4
On 20/04/17 15:59, David Ahern wrote:
> On 4/20/17 8:39 AM, Robert Shearman wrote:
>> On 20/04/17 15:21, David Ahern wrote:
>>> On 4/20/17 6:58 AM, Robert Shearman wrote:
>>>> David reported that doing the following:
>>>>
>>>>     ip li add red type vrf table 10
>>>>     ip link set dev eth1 vrf red
>>>>     ip addr add 127.0.0.1/8 dev red
>>>>     ip link set dev eth1 up
>>>>     ip li set red up
>>>>     ping -c1 -w1 -I red 127.0.0.1
>>>>     ip li del red
>>>>
>>>> results in a hang with this message:
>>>>
>>>>     unregister_netdevice: waiting for red to become free. Usage count
>>>> = 1
>>>
>>> I think you misunderstood my comment. The above works fine today. There
>>> is no bug with refcnts.
>>>
>>> It breaks with your patches wanting to use a VRF device with the main
>>> table (254).
>>
>> That doesn't seem to match with my experience. I can reproduce this on
>> the net tree with the listed commands and the behaviour matches what I
>> see in the code.
>
> Our mileage varies:
>
> root@kenny-jessie3:~# ip li add red type vrf table 10
> root@kenny-jessie3:~#     ip link set dev eth1 vrf red
> root@kenny-jessie3:~#     ip addr add 127.0.0.1/8 dev red
> root@kenny-jessie3:~#     ip link set dev eth1 up
> root@kenny-jessie3:~#     ip li set red up
> root@kenny-jessie3:~#     ping -c1 -w1 -I red 127.0.0.1
> PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data.
> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.059 ms
>
> --- 127.0.0.1 ping statistics ---
> 1 packets transmitted, 1 received, 0% packet loss, time 0ms
> rtt min/avg/max/mdev = 0.059/0.059/0.059/0.000 ms
> root@kenny-jessie3:~#     ip li del red
> root@kenny-jessie3:~# uname -a
> Linux kenny-jessie3 4.11.0-rc6+ #8 SMP Wed Apr 19 11:53:48 PDT 2017
> x86_64 GNU/Linux
>
> The above is one of many tests I run and never hit a problem deleting a
> VRF device. dst's attached to fnhe_rth_output and fnhe_rth_input appear
> to be properly flushed and device references released when the device is
> deleted (NETDEV_DOWN and then NETDEV_UNREGISTER).
>
> Can you send me your kernel config and "sysctl -a --pattern 'net.ipv4'"?
> Perhaps you have something enabled I don't.

The key thing I think is the ip rules:

$ ip rule
0:	from all lookup local
1000:	from all lookup [l3mdev-table]
32766:	from all lookup main
32767:	from all lookup default

Maybe you have the local rule moved down at startup?

I'll send you the requested information if not.

Thanks,
Rob
David Ahern April 20, 2017, 3:16 p.m. | #5
On 4/20/17 9:05 AM, Robert Shearman wrote:
> The key thing I think is the ip rules:
> 
> $ ip rule
> 0:    from all lookup local
> 1000:    from all lookup [l3mdev-table]
> 32766:    from all lookup main
> 32767:    from all lookup default
> 
> Maybe you have the local rule moved down at startup?

yes that would be a problem. With this test the 127.0.0.1 lookup is
matching on the wrong table:

perf probe fib_table_lookup%return ret=%ax
perf record -e fib:*,probe:* -a -- ping -c1 -w1 -I red 127.0.0.1
...
perf script

            ping  2803 [000] 70559.086446: fib:fib_table_lookup: table
255 oif 31 iif 1 src 0.0.0.0 dst 127.0.0.1 tos 0 scope 0 flags 4
            ping  2803 [000] 70559.086451: fib:fib_table_lookup_nh:
nexthop dev lo oif 1 src 127.0.0.1
            ping  2803 [000] 70559.086453: probe:fib_table_lookup:
(ffffffff8146aaaa <- ffffffff81470734) ret=0x0
            ping  2803 [000] 70559.086458: fib:fib_table_lookup: table
255 oif 31 iif 1 src 127.0.0.1 dst 127.0.0.1 tos 0 scope 0 flags 4
            ping  2803 [000] 70559.086459: fib:fib_table_lookup_nh:
nexthop dev lo oif 1 src 127.0.0.1
            ping  2803 [000] 70559.086460: probe:fib_table_lookup:
(ffffffff8146aaaa <- ffffffff81470734) ret=0x0
            ping  2803 [000] 70559.086752: fib:fib_table_lookup: table
255 oif 31 iif 1 src 0.0.0.0 dst 127.0.0.1 tos 0 scope 0 flags 4
            ping  2803 [000] 70559.086754: fib:fib_table_lookup_nh:
nexthop dev lo oif 1 src 127.0.0.1
            ping  2803 [000] 70559.086755: probe:fib_table_lookup:
(ffffffff8146aaaa <- ffffffff81470734) ret=0x0
            ping  2803 [000] 70559.086768: fib:fib_table_lookup: table
255 oif 31 iif 1 src 127.0.0.1 dst 127.0.0.1 tos 0 scope 0 flags 5
            ping  2803 [000] 70559.086769: fib:fib_table_lookup_nh:
nexthop dev lo oif 1 src 127.0.0.1
            ping  2803 [000] 70559.086770: probe:fib_table_lookup:
(ffffffff8146aaaa <- ffffffff81470734) ret=0x0
            ping  2803 [000] 70559.086781: fib:fib_table_lookup: table
255 oif 31 iif 1 src 127.0.0.1 dst 127.0.0.1 tos 0 scope 0 flags 4
            ping  2803 [000] 70559.086782: fib:fib_table_lookup_nh:
nexthop dev lo oif 1 src 127.0.0.1
            ping  2803 [000] 70559.086782: probe:fib_table_lookup:
(ffffffff8146aaaa <- ffffffff81470734) ret=0x0
            ping  2803 [000] 70559.086787: fib:fib_table_lookup: table
255 oif 31 iif 1 src 127.0.0.1 dst 127.0.0.1 tos 0 scope 0 flags 5
            ping  2803 [000] 70559.086788: fib:fib_table_lookup_nh:
nexthop dev lo oif 1 src 127.0.0.1
            ping  2803 [000] 70559.086789: probe:fib_table_lookup:
(ffffffff8146aaaa <- ffffffff81470734) ret=0x0

Table 255 is the wrong table. That is the ultimate problem here.
Robert Shearman April 20, 2017, 3:35 p.m. | #6
On 20/04/17 16:16, David Ahern wrote:
> On 4/20/17 9:05 AM, Robert Shearman wrote:
>> The key thing I think is the ip rules:
>>
>> $ ip rule
>> 0:    from all lookup local
>> 1000:    from all lookup [l3mdev-table]
>> 32766:    from all lookup main
>> 32767:    from all lookup default
>>
>> Maybe you have the local rule moved down at startup?
>
> yes that would be a problem. With this test the 127.0.0.1 lookup is
> matching on the wrong table:
>
> perf probe fib_table_lookup%return ret=%ax
> perf record -e fib:*,probe:* -a -- ping -c1 -w1 -I red 127.0.0.1
> ...
> perf script
>
...
> Table 255 is the wrong table. That is the ultimate problem here.
>

The table used for the lookup could be retrieved from the oif, but the 
check I introduced would still be required to cope with the unusual, but 
not disallowed, case of two VRF master devices referring to the same table.

Thanks,
Rob

Patch

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index acd69cfe2951..f667783ffd19 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2125,6 +2125,14 @@  static struct rtable *__mkroute_output(const struct fib_result *res,
 		fi = NULL;
 	}
 
+	/* If the flag to skip the nh oif check is set then the output
+	 * device may not match the nh device, so cannot use or add to
+	 * cache in that case.
+	 */
+	if (unlikely(fl4->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF &&
+		     FIB_RES_NH(*res).nh_dev != dev_out))
+		do_cache = false;
+
 	fnhe = NULL;
 	do_cache &= fi != NULL;
 	if (do_cache) {