diff mbox

[RFC,net-next,3/3] rcv path changes for vrf traffic

Message ID 1433793517.4616.4.camel@stressinduktion.org
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Hannes Frederic Sowa June 8, 2015, 7:58 p.m. UTC
Hi Shrijeet,

On Mo, 2015-06-08 at 11:35 -0700, Shrijeet Mukherjee wrote:
> From: Shrijeet Mukherjee <shm@cumulusnetworks.com>
> 
> Incoming frames for IP protocol stacks need the IIF to be changed
> from the actual interface to the VRF device. This allows the IIF
> rule to be used to select tables (or do regular PBR)
> 
> This change selects the iif to be the VRF device if it exists and
> the incoming iif is enslaved to the VRF device.
> 
> Since VRF aware sockets are always bound to the VRF device this
> system allows return traffic to find the socket of origin.
> 
> changes are in the arp_rcv, icmp_rcv and ip_rcv paths
> 
> Question : I did not wrap the rcv modifications, in CONFIG_NET_VRF
> as it would create code variations and the vrf_ptr check is there
> I can make that whole thing modular.

From an architectural level I think the output path looks good. For the
input path I would also to propose my (I think) more flexible solution:

For rx layer I want to also propose my try:

[PATCH net-next RFC] net: ipv4: arp: strong end system model semantics by per-interface local table override

By allowing to direct routing table lookups to a specific table based
on the incoming interface for IPv4 and ARP, we start to behave like a
strong end host system without tweaking arp_* sysctl settings.

The main motivation behind this patch was input and forwarding support
in a VRF like model. Maybe it also helps for hardware offloading by
allowing reducing rule complexity.

An example:

$ ip rule flush
$ ip rule del
$ ip rule del
$ ip rule add inherit-table
0:      from all inherit-table

This by default still uses RT_TABLE_LOCAL until we set up per interface
route tables:

$ ip link set dev enp0s25 ipv4-rt-table-id 100
$ ip -d link ls dev enp0s25
2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether e4:7f:b2:1b:4c:61 brd ff:ff:ff:ff:ff:ff promiscuity 0 ipv4-rt-table-id 100 addrgenmode none

This let's incoming and arp requests use routing table 100. The system
will stop responding to arp requests as we don't have any entries in
this routing table.

$ ip address add 192.168.88.223/24 dev enp0s25 table 100
$ ip -d address ls
2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether e4:7f:b2:1b:4c:61 brd ff:ff:ff:ff:ff:ff promiscuity 0
    inet 192.168.88.223/24 scope global enp0s25 table 100
       valid_lft forever preferred_lft forever
$ ip route add 192.168.88.0/24 dev enp0s25 table 100
$ ip route add default via 192.168.88.1 table 100
$ ip route ls dev table 100
local 192.168.88.223 dev enp0s25  proto kernel  scope host  src 192.168.88.223
192.168.88.0/24 dev enp0s25  scope link
default via 192.168.88.1 dev enp0s25 proto static metric 600

Those changes direct arp lookups towards table 100 and the input route,
too. The local address is used for icmp source addresses and arp
replies. The connected route to steer icmp packets out of that interface.

This patch covers only the forwarding path.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/linux/inetdevice.h        | 19 ++++++++++++++++---
 include/net/flow.h                |  2 ++
 include/uapi/linux/fib_rules.h    |  1 +
 include/uapi/linux/if_addr.h      |  1 +
 include/uapi/linux/if_link.h      |  1 +
 net/core/fib_rules.c              | 12 +++++++++---
 net/ipv4/devinet.c                | 18 +++++++++++++++++-
 net/ipv4/fib_frontend.c           | 11 +++++++++--
 net/ipv4/fib_rules.c              |  7 ++++++-
 net/ipv4/fib_semantics.c          |  4 +++-
 net/ipv4/icmp.c                   |  1 +
 net/ipv4/netfilter/ipt_rpfilter.c |  1 +
 net/ipv4/route.c                  |  1 +
 13 files changed, 68 insertions(+), 11 deletions(-)

Comments

Hannes Frederic Sowa June 8, 2015, 8 p.m. UTC | #1
On Mo, 2015-06-08 at 21:58 +0200, Hannes Frederic Sowa wrote:
> Hi Shrijeet,
> 
> On Mo, 2015-06-08 at 11:35 -0700, Shrijeet Mukherjee wrote:
> > From: Shrijeet Mukherjee <shm@cumulusnetworks.com>
> > 
> > Incoming frames for IP protocol stacks need the IIF to be changed
> > from the actual interface to the VRF device. This allows the IIF
> > rule to be used to select tables (or do regular PBR)
> > 
> > This change selects the iif to be the VRF device if it exists and
> > the incoming iif is enslaved to the VRF device.
> > 
> > Since VRF aware sockets are always bound to the VRF device this
> > system allows return traffic to find the socket of origin.
> > 
> > changes are in the arp_rcv, icmp_rcv and ip_rcv paths
> > 
> > Question : I did not wrap the rcv modifications, in CONFIG_NET_VRF
> > as it would create code variations and the vrf_ptr check is there
> > I can make that whole thing modular.
> 
> From an architectural level I think the output path looks good. For 
> the
> input path I would also to propose my (I think) more flexible 
> solution:
> 
> For rx layer I want to also propose my try:
> 
> [PATCH net-next RFC] net: ipv4: arp: strong end system model 
> semantics by per-interface local table override
> 
> By allowing to direct routing table lookups to a specific table based
> on the incoming interface for IPv4 and ARP, we start to behave like a
> strong end host system without tweaking arp_* sysctl settings.
> 
> The main motivation behind this patch was input and forwarding 
> support
> in a VRF like model. Maybe it also helps for hardware offloading by
> allowing reducing rule complexity.
> 
> An example:
> 
> $ ip rule flush
> $ ip rule del
> $ ip rule del
> $ ip rule add inherit-table
> 0:      from all inherit-table
> 
> This by default still uses RT_TABLE_LOCAL until we set up per 
> interface
> route tables:
> 
> $ ip link set dev enp0s25 ipv4-rt-table-id 100
> $ ip -d link ls dev enp0s25
> 2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel 
> state UP mode DEFAULT group default qlen 1000
>     link/ether e4:7f:b2:1b:4c:61 brd ff:ff:ff:ff:ff:ff promiscuity 0 
> ipv4-rt-table-id 100 addrgenmode none
> 
> This let's incoming and arp requests use routing table 100. The 
> system
> will stop responding to arp requests as we don't have any entries in
> this routing table.
> 
> $ ip address add 192.168.88.223/24 dev enp0s25 table 100
> $ ip -d address ls
> 2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel 
> state UP group default qlen 1000
>     link/ether e4:7f:b2:1b:4c:61 brd ff:ff:ff:ff:ff:ff promiscuity 0
>     inet 192.168.88.223/24 scope global enp0s25 table 100
>        valid_lft forever preferred_lft forever
> $ ip route add 192.168.88.0/24 dev enp0s25 table 100
> $ ip route add default via 192.168.88.1 table 100
> $ ip route ls dev table 100
> local 192.168.88.223 dev enp0s25  proto kernel  scope host  src 
> 192.168.88.223
> 192.168.88.0/24 dev enp0s25  scope link
> default via 192.168.88.1 dev enp0s25 proto static metric 600
> 
> Those changes direct arp lookups towards table 100 and the input 
> route,
> too. The local address is used for icmp source addresses and arp
> replies. The connected route to steer icmp packets out of that 
> interface.
> 
> This patch covers only the forwarding path.

The iproute2 patch is currently here:

https://github.com/hannes/iproute2/commits/vrf

Bye,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shrijeet Mukherjee June 8, 2015, 8:22 p.m. UTC | #2
On Mon, Jun 8, 2015 at 12:58 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hi Shrijeet,
>
> From an architectural level I think the output path looks good. For the
> input path I would also to propose my (I think) more flexible solution:
>
> For rx layer I want to also propose my try:
>
> [PATCH net-next RFC] net: ipv4: arp: strong end system model semantics by per-interface local table override
>
> By allowing to direct routing table lookups to a specific table based
> on the incoming interface for IPv4 and ARP, we start to behave like a
> strong end host system without tweaking arp_* sysctl settings.
>
> The main motivation behind this patch was input and forwarding support
> in a VRF like model. Maybe it also helps for hardware offloading by
> allowing reducing rule complexity.
>
> An example:
>
> $ ip rule flush
> $ ip rule del
> $ ip rule del
> $ ip rule add inherit-table
> 0:      from all inherit-table
>
> This by default still uses RT_TABLE_LOCAL until we set up per interface
> route tables:
>
> $ ip link set dev enp0s25 ipv4-rt-table-id 100
> $ ip -d link ls dev enp0s25
> 2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
>     link/ether e4:7f:b2:1b:4c:61 brd ff:ff:ff:ff:ff:ff promiscuity 0 ipv4-rt-table-id 100 addrgenmode none
>
> This let's incoming and arp requests use routing table 100. The system
> will stop responding to arp requests as we don't have any entries in
> this routing table.


I like this model in general, as it addresses the issue that I have
not addressed around connected routes.

This would force local and directly connected host routes to be learnt
into the correct table.

It does bring the question up then.

1. The driver already knows the vrf device to table map
2. If the device also knows the final device to table map

then do we need to use fib_rules and just lookup the table directly.
It does make the configuration a little longer since each component
device now needs configuration when you add/del a member from a vrf.

If people generally agree and we want to skip the fib_rule lookup,
then I can make it such that enslaving already takes the dev-table id
as well, and then the process of enslaving in the nominal VRF case
becomes

ip link add vrf-dev type vrf table foo ipv4-rt-table-id bar
ip link set eth2 master vrf-dev

Does that work ?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa June 8, 2015, 8:33 p.m. UTC | #3
On Mon, Jun 8, 2015, at 22:22, Shrijeet Mukherjee wrote:
> On Mon, Jun 8, 2015 at 12:58 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> > Hi Shrijeet,
> >
> > From an architectural level I think the output path looks good. For the
> > input path I would also to propose my (I think) more flexible solution:
> >
> > For rx layer I want to also propose my try:
> >
> > [PATCH net-next RFC] net: ipv4: arp: strong end system model semantics by per-interface local table override
> >
> > By allowing to direct routing table lookups to a specific table based
> > on the incoming interface for IPv4 and ARP, we start to behave like a
> > strong end host system without tweaking arp_* sysctl settings.
> >
> > The main motivation behind this patch was input and forwarding support
> > in a VRF like model. Maybe it also helps for hardware offloading by
> > allowing reducing rule complexity.
> >
> > An example:
> >
> > $ ip rule flush
> > $ ip rule del
> > $ ip rule del
> > $ ip rule add inherit-table
> > 0:      from all inherit-table
> >
> > This by default still uses RT_TABLE_LOCAL until we set up per interface
> > route tables:
> >
> > $ ip link set dev enp0s25 ipv4-rt-table-id 100
> > $ ip -d link ls dev enp0s25
> > 2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
> >     link/ether e4:7f:b2:1b:4c:61 brd ff:ff:ff:ff:ff:ff promiscuity 0 ipv4-rt-table-id 100 addrgenmode none
> >
> > This let's incoming and arp requests use routing table 100. The system
> > will stop responding to arp requests as we don't have any entries in
> > this routing table.
> 
> 
> I like this model in general, as it addresses the issue that I have
> not addressed around connected routes.
> 
> This would force local and directly connected host routes to be learnt
> into the correct table.
> 
> It does bring the question up then.
> 
> 1. The driver already knows the vrf device to table map
> 2. If the device also knows the final device to table map
> 
> then do we need to use fib_rules and just lookup the table directly.
> It does make the configuration a little longer since each component
> device now needs configuration when you add/del a member from a vrf.

This model is usable on its own, especially if one does not need routing
daemons
or user space software dealing with VRFs and sending out packets.

> If people generally agree and we want to skip the fib_rule lookup,
> then I can make it such that enslaving already takes the dev-table id
> as well, and then the process of enslaving in the nominal VRF case
> becomes
> 
> ip link add vrf-dev type vrf table foo ipv4-rt-table-id bar
> ip link set eth2 master vrf-dev

I think this would be great.

Last time I looked into the patches it was not yet clear if we can do
that
without holding strong references to the other interfaces. Hopefully
this can
be done by just passing down the table ids to the slaves during
initializing
and teardown of the master vrf interface.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 8, 2015, 10:05 p.m. UTC | #4
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Mon, 08 Jun 2015 21:58:37 +0200

> +static inline u32 ipv4_idev_rt_table(const struct net_device *dev)
> +{
> +       u32 table_id;
> +
> +       rcu_read_lock();
> +       table_id = __in_dev_get_rcu(dev)->rt_table_id;
> +       rcu_read_unlock();
> +
> +       return table_id != RT_TABLE_UNSPEC ? table_id : RT_TABLE_LOCAL;
> +}

It's a real shame you have to do all of this RCU locking and inetdev
deref, because in more than half of the call sites the idev is already
available.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa June 8, 2015, 10:13 p.m. UTC | #5
On Tue, Jun 9, 2015, at 00:05, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Mon, 08 Jun 2015 21:58:37 +0200
> 
> > +static inline u32 ipv4_idev_rt_table(const struct net_device *dev)
> > +{
> > +       u32 table_id;
> > +
> > +       rcu_read_lock();
> > +       table_id = __in_dev_get_rcu(dev)->rt_table_id;
> > +       rcu_read_unlock();
> > +
> > +       return table_id != RT_TABLE_UNSPEC ? table_id : RT_TABLE_LOCAL;
> > +}
> 
> It's a real shame you have to do all of this RCU locking and inetdev
> deref, because in more than half of the call sites the idev is already
> available.

I agree, I was not happy with that either.

It is easy to move the rt_table_id to net_device and use the same one
for IPv6.
This would force people to build symmetric routing configurations. I was
striving for
maximum flexibility first but I don't really think this matters here.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 8, 2015, 10:21 p.m. UTC | #6
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Tue, 09 Jun 2015 00:13:08 +0200

> It is easy to move the rt_table_id to net_device and use the same
> one for IPv6.  This would force people to build symmetric routing
> configurations. I was striving for maximum flexibility first but I
> don't really think this matters here.

Alternatively you could have __ipv4_idev_rt_table(idev) and implement
ipv4_idev_rt_table() in terms of that.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shrijeet Mukherjee June 8, 2015, 10:44 p.m. UTC | #7
On Mon, Jun 8, 2015 at 1:33 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On Mon, Jun 8, 2015, at 22:22, Shrijeet Mukherjee wrote:
>> On Mon, Jun 8, 2015 at 12:58 PM, Hannes Frederic Sowa
>> <hannes@stressinduktion.org> wrote:
>> > Hi Shrijeet,
>> >
>> > This let's incoming and arp requests use routing table 100. The system
>> > will stop responding to arp requests as we don't have any entries in
>> > this routing table.
>>
>>
>> I like this model in general, as it addresses the issue that I have
>> not addressed around connected routes.
>>
>> This would force local and directly connected host routes to be learnt
>> into the correct table.
>>
>> It does bring the question up then.
>>
>> 1. The driver already knows the vrf device to table map
>> 2. If the device also knows the final device to table map
>>
>> then do we need to use fib_rules and just lookup the table directly.
>> It does make the configuration a little longer since each component
>> device now needs configuration when you add/del a member from a vrf.
>
> This model is usable on its own, especially if one does not need routing
> daemons
> or user space software dealing with VRFs and sending out packets.
>
>> If people generally agree and we want to skip the fib_rule lookup,
>> then I can make it such that enslaving already takes the dev-table id
>> as well, and then the process of enslaving in the nominal VRF case
>> becomes
>>
>> ip link add vrf-dev type vrf table foo ipv4-rt-table-id bar
>> ip link set eth2 master vrf-dev
>
> I think this would be great.
>
> Last time I looked into the patches it was not yet clear if we can do
> that
> without holding strong references to the other interfaces. Hopefully
> this can
> be done by just passing down the table ids to the slaves during
> initializing
> and teardown of the master vrf interface.
>
> Bye,
> Hannes

We can do that, and the hooks are all available. But do we want to cut
out the fib_rules ? this would close out the opportunity for someone
to insert a fib_rule to override the rule which directs to a VRF
device.

Personally don't have a strong opinion, but want to make sure we
understand that choice.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Ahern June 9, 2015, 1:03 a.m. UTC | #8
On 6/8/15 1:58 PM, Hannes Frederic Sowa wrote:
> For rx layer I want to also propose my try:
>
> [PATCH net-next RFC] net: ipv4: arp: strong end system model semantics by per-interface local table override
>

I applied only the first 2 patches from Shrijeet and then tried to apply 
your patch; it doesn't apply. Way too many failures. What branch should 
it apply too?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa June 9, 2015, 5:35 a.m. UTC | #9
Hi,

On Tue, Jun 9, 2015, at 03:03, David Ahern wrote:
> On 6/8/15 1:58 PM, Hannes Frederic Sowa wrote:
> > For rx layer I want to also propose my try:
> >
> > [PATCH net-next RFC] net: ipv4: arp: strong end system model semantics by per-interface local table override
> >
> 
> I applied only the first 2 patches from Shrijeet and then tried to apply 
> your patch; it doesn't apply. Way too many failures. What branch should 
> it apply too?

The patch is currently stand-alone and should apply ontop of net-next
commit 6da8253bdd3945b81377e4908d6d395a9956f8af
Author: Florian Fainelli <f.fainelli@gmail.com>
Date:   Mon Jun 8 11:05:20 2015 -0700

net: phy: bcm7xxx: update workaround to fix 100BaseT corner cases

Shrijeet and me will consolidate that soon.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa June 9, 2015, 5:41 a.m. UTC | #10
On Tue, Jun 9, 2015, at 00:44, Shrijeet Mukherjee wrote:
> On Mon, Jun 8, 2015 at 1:33 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> > On Mon, Jun 8, 2015, at 22:22, Shrijeet Mukherjee wrote:
> >> On Mon, Jun 8, 2015 at 12:58 PM, Hannes Frederic Sowa
> >> <hannes@stressinduktion.org> wrote:
> >> > Hi Shrijeet,
> >> >
> >> > This let's incoming and arp requests use routing table 100. The system
> >> > will stop responding to arp requests as we don't have any entries in
> >> > this routing table.
> >>
> >>
> >> I like this model in general, as it addresses the issue that I have
> >> not addressed around connected routes.
> >>
> >> This would force local and directly connected host routes to be learnt
> >> into the correct table.
> >>
> >> It does bring the question up then.
> >>
> >> 1. The driver already knows the vrf device to table map
> >> 2. If the device also knows the final device to table map
> >>
> >> then do we need to use fib_rules and just lookup the table directly.
> >> It does make the configuration a little longer since each component
> >> device now needs configuration when you add/del a member from a vrf.
> >
> > This model is usable on its own, especially if one does not need routing
> > daemons
> > or user space software dealing with VRFs and sending out packets.
> >
> >> If people generally agree and we want to skip the fib_rule lookup,
> >> then I can make it such that enslaving already takes the dev-table id
> >> as well, and then the process of enslaving in the nominal VRF case
> >> becomes
> >>
> >> ip link add vrf-dev type vrf table foo ipv4-rt-table-id bar
> >> ip link set eth2 master vrf-dev
> >
> > I think this would be great.
> >
> > Last time I looked into the patches it was not yet clear if we can do
> > that
> > without holding strong references to the other interfaces. Hopefully
> > this can
> > be done by just passing down the table ids to the slaves during
> > initializing
> > and teardown of the master vrf interface.
> >
> > Bye,
> > Hannes
> 
> We can do that, and the hooks are all available. But do we want to cut
> out the fib_rules ? this would close out the opportunity for someone
> to insert a fib_rule to override the rule which directs to a VRF
> device.
> 
> Personally don't have a strong opinion, but want to make sure we
> understand that choice.

Hmm, wouldn't that still work with a target I added in my patch?

The only problem I see is that people might build up rules which are not
symmetric and thus vrf behavior differs from input and output path. One
addition to ease this is to add a interface selector which matches on
both, iif and oif. Also we must still keep in mind that rules are
matched linearly by using a list, hundreds of vrfs would thus first have
match hundreds of ip rules.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index 0a21fbe..ed68f8e 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -25,19 +25,20 @@  struct in_device {
        atomic_t                refcnt;
        int                     dead;
        struct in_ifaddr        *ifa_list;      /* IP ifaddr chain              */
+       u32                     rt_table_id;
 
        struct ip_mc_list __rcu *mc_list;       /* IP multicast filter chain    */
        struct ip_mc_list __rcu * __rcu *mc_hash;
 
        int                     mc_count;       /* Number of installed mcasts   */
+       unsigned char           mr_qrv;
+       unsigned char           mr_gq_running;
+       unsigned char           mr_ifc_count;
        spinlock_t              mc_tomb_lock;
        struct ip_mc_list       *mc_tomb;
        unsigned long           mr_v1_seen;
        unsigned long           mr_v2_seen;
        unsigned long           mr_maxdelay;
-       unsigned char           mr_qrv;
-       unsigned char           mr_gq_running;
-       unsigned char           mr_ifc_count;
        struct timer_list       mr_gq_timer;    /* general query timer */
        struct timer_list       mr_ifc_timer;   /* interface change timer */
 
@@ -145,6 +146,7 @@  struct in_ifaddr {
        __u32                   ifa_preferred_lft;
        unsigned long           ifa_cstamp; /* created timestamp */
        unsigned long           ifa_tstamp; /* updated timestamp */
+       __u32                   ifa_rt_table; /* subnet route table */
 };
 
 int register_inetaddr_notifier(struct notifier_block *nb);
@@ -237,6 +239,17 @@  static inline void in_dev_put(struct in_device *idev)
 #define __in_dev_put(idev)  atomic_dec(&(idev)->refcnt)
 #define in_dev_hold(idev)   atomic_inc(&(idev)->refcnt)
 
+static inline u32 ipv4_idev_rt_table(const struct net_device *dev)
+{
+       u32 table_id;
+
+       rcu_read_lock();
+       table_id = __in_dev_get_rcu(dev)->rt_table_id;
+       rcu_read_unlock();
+
+       return table_id != RT_TABLE_UNSPEC ? table_id : RT_TABLE_LOCAL;
+}
+
 #endif /* __KERNEL__ */
 
 static __inline__ __be32 inet_make_mask(int logmask)
diff --git a/include/net/flow.h b/include/net/flow.h
index 8109a15..635e028 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -70,6 +70,8 @@  struct flowi4 {
        /* (saddr,daddr) must be grouped, same order as in IP header */
        __be32                  saddr;
        __be32                  daddr;
+       __u32                   rt_table_id;
+
 
        union flowi_uli         uli;
 #define fl4_sport              uli.ports.sport
diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h
index 2b82d7e..da7c79a 100644
--- a/include/uapi/linux/fib_rules.h
+++ b/include/uapi/linux/fib_rules.h
@@ -64,6 +64,7 @@  enum {
        FR_ACT_BLACKHOLE,       /* Drop without notification */
        FR_ACT_UNREACHABLE,     /* Drop with ENETUNREACH */
        FR_ACT_PROHIBIT,        /* Drop with EACCES */
+       FR_ACT_TO_TBL_INHERIT_DEV,
        __FR_ACT_MAX,
 };
 
diff --git a/include/uapi/linux/if_addr.h b/include/uapi/linux/if_addr.h
index 4318ab1..af89016 100644
--- a/include/uapi/linux/if_addr.h
+++ b/include/uapi/linux/if_addr.h
@@ -32,6 +32,7 @@  enum {
        IFA_CACHEINFO,
        IFA_MULTICAST,
        IFA_FLAGS,
+       IFA_RT_TABLE,
        __IFA_MAX,
 };
 
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 1737b7a..7f4cdb2 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -163,6 +163,7 @@  enum {
 enum {
        IFLA_INET_UNSPEC,
        IFLA_INET_CONF,
+       IFLA_INET_RT_TABLE,
        __IFLA_INET_MAX,
 };
 
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 9a12668..2728873 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -556,11 +556,17 @@  static int fib_nl_fill_rule(struct sk_buff *skb, struct fib_rule *rule,
 
        frh = nlmsg_data(nlh);
        frh->family = ops->family;
-       frh->table = rule->table;
-       if (nla_put_u32(skb, FRA_TABLE, rule->table))
-               goto nla_put_failure;
+
+       /* table id is not valid if we inherit from interface */
+       if (rule->action != FR_ACT_TO_TBL_INHERIT_DEV) {
+               frh->table = rule->table;
+               if (nla_put_u32(skb, FRA_TABLE, rule->table))
+                       goto nla_put_failure;
+       }
+
        if (nla_put_u32(skb, FRA_SUPPRESS_PREFIXLEN, rule->suppress_prefixlen))
                goto nla_put_failure;
+
        frh->res1 = 0;
        frh->res2 = 0;
        frh->action = rule->action;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 419d23c..91f074d 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -100,6 +100,7 @@  static const struct nla_policy ifa_ipv4_policy[IFA_MAX+1] = {
        [IFA_LABEL]             = { .type = NLA_STRING, .len = IFNAMSIZ - 1 },
        [IFA_CACHEINFO]         = { .len = sizeof(struct ifa_cacheinfo) },
        [IFA_FLAGS]             = { .type = NLA_U32 },
+       [IFA_RT_TABLE]          = { .type = NLA_U32 },
 };
 
 #define IN4_ADDR_HSIZE_SHIFT   8
@@ -244,6 +245,7 @@  static struct in_device *inetdev_init(struct net_device *dev)
                        sizeof(in_dev->cnf));
        in_dev->cnf.sysctl = NULL;
        in_dev->dev = dev;
+       in_dev->rt_table_id = RT_TABLE_UNSPEC;
        in_dev->arp_parms = neigh_parms_alloc(dev, &arp_tbl);
        if (!in_dev->arp_parms)
                goto out_kfree;
@@ -783,6 +785,11 @@  static struct in_ifaddr *rtm_to_ifaddr(struct net *net, struct nlmsghdr *nlh,
        if (!tb[IFA_ADDRESS])
                tb[IFA_ADDRESS] = tb[IFA_LOCAL];
 
+       if (tb[IFA_RT_TABLE])
+               ifa->ifa_rt_table = nla_get_u32(tb[IFA_RT_TABLE]);
+       else
+               ifa->ifa_rt_table = RT_TABLE_UNSPEC;
+
        INIT_HLIST_NODE(&ifa->hash);
        ifa->ifa_prefixlen = ifm->ifa_prefixlen;
        ifa->ifa_mask = inet_make_mask(ifm->ifa_prefixlen);
@@ -1549,6 +1556,7 @@  static int inet_fill_ifaddr(struct sk_buff *skb, struct in_ifaddr *ifa,
            (ifa->ifa_label[0] &&
             nla_put_string(skb, IFA_LABEL, ifa->ifa_label)) ||
            nla_put_u32(skb, IFA_FLAGS, ifa->ifa_flags) ||
+           nla_put_u32(skb, IFA_RT_TABLE, ifa->ifa_rt_table) ||
            put_cacheinfo(skb, ifa->ifa_cstamp, ifa->ifa_tstamp,
                          preferred, valid))
                goto nla_put_failure;
@@ -1652,7 +1660,8 @@  static size_t inet_get_link_af_size(const struct net_device *dev)
        if (!in_dev)
                return 0;
 
-       return nla_total_size(IPV4_DEVCONF_MAX * 4); /* IFLA_INET_CONF */
+       return nla_total_size(IPV4_DEVCONF_MAX * 4) +   /* IFLA_INET_CONF */
+              nla_total_size(sizeof(u32));             /* IFLA_INET_RT_TABLE */
 }
 
 static int inet_fill_link_af(struct sk_buff *skb, const struct net_device *dev)
@@ -1664,6 +1673,9 @@  static int inet_fill_link_af(struct sk_buff *skb, const struct net_device *dev)
        if (!in_dev)
                return -ENODATA;
 
+       if (nla_put_u32(skb, IFLA_INET_RT_TABLE, in_dev->rt_table_id) < 0)
+               return -EMSGSIZE;
+
        nla = nla_reserve(skb, IFLA_INET_CONF, IPV4_DEVCONF_MAX * 4);
        if (!nla)
                return -EMSGSIZE;
@@ -1676,6 +1688,7 @@  static int inet_fill_link_af(struct sk_buff *skb, const struct net_device *dev)
 
 static const struct nla_policy inet_af_policy[IFLA_INET_MAX+1] = {
        [IFLA_INET_CONF]        = { .type = NLA_NESTED },
+       [IFLA_INET_RT_TABLE]    = { .type = NLA_U32 },
 };
 
 static int inet_validate_link_af(const struct net_device *dev,
@@ -1723,6 +1736,9 @@  static int inet_set_link_af(struct net_device *dev, const struct nlattr *nla)
                        ipv4_devconf_set(in_dev, nla_type(a), nla_get_u32(a));
        }
 
+       if (tb[IFLA_INET_RT_TABLE])
+               in_dev->rt_table_id = nla_get_u32(tb[IFLA_INET_RT_TABLE]);
+
        return 0;
 }
 
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 872494e..56b2656 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -225,7 +225,7 @@  static inline unsigned int __inet_dev_addr_type(struct net *net,
 
        rcu_read_lock();
 
-       local_table = fib_get_table(net, RT_TABLE_LOCAL);
+       local_table = fib_get_table(net, dev ? ipv4_idev_rt_table(dev) : RT_TABLE_LOCAL);
        if (local_table) {
                ret = RTN_UNICAST;
                if (!fib_table_lookup(local_table, &fl4, &res, FIB_LOOKUP_NOREF)) {
@@ -277,6 +277,7 @@  __be32 fib_compute_spec_dst(struct sk_buff *skb)
                fl4.flowi4_iif = LOOPBACK_IFINDEX;
                fl4.daddr = ip_hdr(skb)->saddr;
                fl4.saddr = 0;
+               fl4.rt_table_id = ipv4_idev_rt_table(dev);
                fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
                fl4.flowi4_scope = scope;
                fl4.flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0;
@@ -311,6 +312,7 @@  static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
        fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
        fl4.daddr = src;
        fl4.saddr = dst;
+       fl4.rt_table_id =  ipv4_idev_rt_table(dev);
        fl4.flowi4_tos = tos;
        fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
 
@@ -774,7 +776,12 @@  static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifad
                },
        };
 
-       if (type == RTN_UNICAST)
+       /* if ifa_rt_table is different from default RT_TABLE_LOCAL
+        * use its value for all types of routes
+        */
+       if (ifa->ifa_rt_table != RT_TABLE_UNSPEC)
+               tb = fib_new_table(net, ifa->ifa_rt_table);
+       else if (type == RTN_UNICAST)
                tb = fib_new_table(net, RT_TABLE_MAIN);
        else
                tb = fib_new_table(net, RT_TABLE_LOCAL);
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index 5615198..acb415c 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -75,9 +75,14 @@  static int fib4_rule_action(struct fib_rule *rule, struct flowi *flp,
 {
        int err = -EAGAIN;
        struct fib_table *tbl;
+       u32 table;
 
        switch (rule->action) {
+       case FR_ACT_TO_TBL_INHERIT_DEV:
+               table = flp->u.ip4.rt_table_id;
+               break;
        case FR_ACT_TO_TBL:
+               table = rule->table;
                break;
 
        case FR_ACT_UNREACHABLE:
@@ -93,7 +98,7 @@  static int fib4_rule_action(struct fib_rule *rule, struct flowi *flp,
 
        rcu_read_lock();
 
-       tbl = fib_get_table(rule->fr_net, rule->table);
+       tbl = fib_get_table(rule->fr_net, table);
        if (tbl)
                err = fib_table_lookup(tbl, &flp->u.ip4,
                                       (struct fib_result *)arg->result,
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 28ec3c1..afb0011 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -587,7 +587,7 @@  static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 {
        int err;
        struct net *net;
-       struct net_device *dev;
+       struct net_device *dev = NULL;
 
        net = cfg->fc_nlinfo.nl_net;
        if (nh->nh_gw) {
@@ -616,6 +616,8 @@  static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
                                .flowi4_scope = cfg->fc_scope + 1,
                                .flowi4_oif = nh->nh_oif,
                                .flowi4_iif = LOOPBACK_IFINDEX,
+                               .rt_table_id = dev ? ipv4_idev_rt_table(dev)
+                                              : RT_TABLE_LOCAL,
                        };
 
                        /* It is not necessary, but requires a bit of thinking */
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index f5203fb..36952c8 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -425,6 +425,7 @@  static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
        fl4.flowi4_mark = mark;
        fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
        fl4.flowi4_proto = IPPROTO_ICMP;
+       fl4.rt_table_id = ipv4_idev_rt_table(skb->dev);
        security_skb_classify_flow(skb, flowi4_to_flowi(&fl4));
        rt = ip_route_output_key(net, &fl4);
        if (IS_ERR(rt))
diff --git a/net/ipv4/netfilter/ipt_rpfilter.c b/net/ipv4/netfilter/ipt_rpfilter.c
index 4bfaedf..c7c1407 100644
--- a/net/ipv4/netfilter/ipt_rpfilter.c
+++ b/net/ipv4/netfilter/ipt_rpfilter.c
@@ -93,6 +93,7 @@  static bool rpfilter_mt(const struct sk_buff *skb, struct xt_action_param *par)
        flow.flowi4_iif = LOOPBACK_IFINDEX;
        flow.daddr = iph->saddr;
        flow.saddr = rpfilter_get_saddr(iph->daddr);
+       flow.rt_table_id = ipv4_idev_rt_table(skb->dev);
        flow.flowi4_oif = 0;
        flow.flowi4_mark = info->flags & XT_RPFILTER_VALID_MARK ? skb->mark : 0;
        flow.flowi4_tos = RT_TOS(iph->tos);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f605598..eec1908 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1716,6 +1716,7 @@  static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
        fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
        fl4.daddr = daddr;
        fl4.saddr = saddr;
+       fl4.rt_table_id = ipv4_idev_rt_table(dev);
        err = fib_lookup(net, &fl4, &res);
        if (err != 0) {
                if (!IN_DEV_FORWARD(in_dev))