diff mbox

[2/2] ipv4: Change rt->rt_iif encoding.

Message ID 20120723.140737.822125500680433996.davem@davemloft.net
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

David Miller July 23, 2012, 9:07 p.m. UTC
On input packet processing, rt->rt_iif will be zero if we should
use skb->dev->ifindex.

Since we access rt->rt_iif consistently via inet_iif(), that is
the only spot whose interpretation have to adjust.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/route.h |    6 +++++-
 net/ipv4/route.c    |    8 ++++----
 2 files changed, 9 insertions(+), 5 deletions(-)

Comments

David Miller July 23, 2012, 10:22 p.m. UTC | #1
From: David Miller <davem@davemloft.net>
Date: Mon, 23 Jul 2012 14:07:37 -0700 (PDT)

> On input packet processing, rt->rt_iif will be zero if we should
> use skb->dev->ifindex.
> 
> Since we access rt->rt_iif consistently via inet_iif(), that is
> the only spot whose interpretation have to adjust.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>

Julian, as you seem to have feared, this turns out to not work.

We zap the skb->dev apparently, so I'll need to come up with
a different scheme to handle this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller July 23, 2012, 11:05 p.m. UTC | #2
From: Julian Anastasov <ja@ssi.bg>
Date: Tue, 24 Jul 2012 02:11:38 +0300 (EEST)

> skb_iif: ifindex of device we arrived on
> 
> 	If it has hidden semantic may be we can make this
> comment more specific, does it survive some loops?

Strangely I hadn't noticed this, and after the email I just sent
out I was going to look into adding such a value :-)

Great!

I'll respin my patches to use that thing, thanks Julian.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Julian Anastasov July 23, 2012, 11:11 p.m. UTC | #3
Hello,

On Mon, 23 Jul 2012, David Miller wrote:

> From: David Miller <davem@davemloft.net>
> Date: Mon, 23 Jul 2012 14:07:37 -0700 (PDT)
> 
> > On input packet processing, rt->rt_iif will be zero if we should
> > use skb->dev->ifindex.
> > 
> > Since we access rt->rt_iif consistently via inet_iif(), that is
> > the only spot whose interpretation have to adjust.
> > 
> > Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> Julian, as you seem to have feared, this turns out to not work.
> 
> We zap the skb->dev apparently, so I'll need to come up with
> a different scheme to handle this.

	I was just replying that I don't see any problems
with the 4 patches but when you say it, may be the net/sched
code will work with output device on forwarding, not input.
Is skb_iif a good source? From include/linux/skbuff.h:

skb_iif: ifindex of device we arrived on

	If it has hidden semantic may be we can make this
comment more specific, does it survive some loops?

	Anyways, the idea for new encoding of rt_iif is very good.

Regards

--
Julian Anastasov <ja@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller July 23, 2012, 11:14 p.m. UTC | #4
From: David Miller <davem@davemloft.net>
Date: Mon, 23 Jul 2012 16:05:41 -0700 (PDT)

> From: Julian Anastasov <ja@ssi.bg>
> Date: Tue, 24 Jul 2012 02:11:38 +0300 (EEST)
> 
>> skb_iif: ifindex of device we arrived on
>> 
>> 	If it has hidden semantic may be we can make this
>> comment more specific, does it survive some loops?
> 
> Strangely I hadn't noticed this, and after the email I just sent
> out I was going to look into adding such a value :-)
> 
> Great!
> 
> I'll respin my patches to use that thing, thanks Julian.

Hmmm, the problem is that when we decapsulate VLAN devices, we're left
with the parent device's index in skb->skb_iif.

That's what all of that "orig_dev" logic in __netif_receive_skb() is
all about.  The skb->dev is rewritten by vlan_do_receive() for that
case.

I wonder if we should just get rid of all of that orig_dev logic and
simply update skb->skb_iif every time we hit the code starting at
label "another_round"

I'll keep looking into this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Julian Anastasov July 24, 2012, 12:24 a.m. UTC | #5
Hello,

On Mon, 23 Jul 2012, David Miller wrote:

> Hmmm, the problem is that when we decapsulate VLAN devices, we're left
> with the parent device's index in skb->skb_iif.

	Not sure if it is a problem with VLANs, can be also
with some virtual devices but may be they use dev_forward_skb()
where skb_iif is zeroed.

> That's what all of that "orig_dev" logic in __netif_receive_skb() is
> all about.  The skb->dev is rewritten by vlan_do_receive() for that
> case.

	This is the part that I didn't know for skb_iif. I was
also worrying about ip_mc_output looping packets with skb_iif
because skb_clone copies the field but may be such loops
happen only for locally originated traffic where skb_iif starts
with 0.

	Still, the comment in ipmr_queue_xmit() and the
IPSKB_FORWARDED bit puzzles me:

         * RFC1584 teaches, that DVMRP/PIM router must deliver packets locally
         * not only before forwarding, but after forwarding on all output
         * interfaces. It is clear, if mrouter runs a multicasting

	ip_mr_forward preserves one copy to local app,
calls ipmr_queue_xmit() where packet can be looped?

	ip_mr_input():

        /* Packet is looped back after forward, it should not be
         * forwarded second time, but still can be delivered locally.
         */

	I really hope the case is for locally originated
multicast, not for packets received from network.

> I wonder if we should just get rid of all of that orig_dev logic and
> simply update skb->skb_iif every time we hit the code starting at
> label "another_round"

	I don't have opinion about this problem. I see your
new change for __netif_receive_skb, the skb_iif places are
something that I have to check for me...

Regards

--
Julian Anastasov <ja@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller July 24, 2012, 12:43 a.m. UTC | #6
From: Julian Anastasov <ja@ssi.bg>
Date: Tue, 24 Jul 2012 03:24:38 +0300 (EEST)

> On Mon, 23 Jul 2012, David Miller wrote:
> 
>> Hmmm, the problem is that when we decapsulate VLAN devices, we're left
>> with the parent device's index in skb->skb_iif.
> 
> 	Not sure if it is a problem with VLANs, can be also
> with some virtual devices but may be they use dev_forward_skb()
> where skb_iif is zeroed.

dev_forward_skb() gives the packet to netif_rx() which will thus send
the packet down to __netif_receive_skb() which will set the
skb->skb_iif to the new device's ifindex.  It will not stay at zero
:-)

> 	I was also worrying about ip_mc_output looping packets with
> skb_iif because skb_clone copies the field but may be such loops
> happen only for locally originated traffic where skb_iif starts with
> 0.

Loopback of multicast packets is done in ip_mc_output(), via
the clone that you mention, via dev_loopback_xmit().

dev_loopback_xmit() gives the packet to netif_rx_ni() which again
lands it back at __netif_receive_skb(), which (with my changes)
will adjust the skb->skb_iif to match whatever sits in skb->dev
at the time.

It should be alright.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Julian Anastasov July 24, 2012, 7:41 a.m. UTC | #7
Hello,

On Mon, 23 Jul 2012, David Miller wrote:

> From: Julian Anastasov <ja@ssi.bg>
> Date: Tue, 24 Jul 2012 03:24:38 +0300 (EEST)
> 
> > On Mon, 23 Jul 2012, David Miller wrote:
> > 
> >> Hmmm, the problem is that when we decapsulate VLAN devices, we're left
> >> with the parent device's index in skb->skb_iif.
> > 
> > 	Not sure if it is a problem with VLANs, can be also
> > with some virtual devices but may be they use dev_forward_skb()
> > where skb_iif is zeroed.
> 
> dev_forward_skb() gives the packet to netif_rx() which will thus send
> the packet down to __netif_receive_skb() which will set the
> skb->skb_iif to the new device's ifindex.  It will not stay at zero
> :-)
> 
> > 	I was also worrying about ip_mc_output looping packets with
> > skb_iif because skb_clone copies the field but may be such loops
> > happen only for locally originated traffic where skb_iif starts with
> > 0.
> 
> Loopback of multicast packets is done in ip_mc_output(), via
> the clone that you mention, via dev_loopback_xmit().
> 
> dev_loopback_xmit() gives the packet to netif_rx_ni() which again
> lands it back at __netif_receive_skb(), which (with my changes)
> will adjust the skb->skb_iif to match whatever sits in skb->dev
> at the time.
> 
> It should be alright.

	Yes, I know that now it is set in all cases, I was
wondering why __netif_receive_skb was prepared to see
non-zero value. Please also check xfrm4_fill_dst, the
possible case where the new skb_iif usage can differ from
rt_iif is for forwarded packets where skb->dst is changed
with new version, for example:

ip_forward -> *xfrm*_policy_check* -> __xfrm_decode_session ->
_decode_session4 memsets flowi4_iif (now to get skb_iif).

ip_forward -> xfrm4_route_forward -> ... xfrm_lookup ->
... -> xfrm4_fill_dst now will use skb_iif ?

	XFRM is still magic for me and I'm not sure if
we need some other device instead of skb_iif. Should we
prefer dst->dev for output routes? For example:

static inline int inet_iif(const struct sk_buff *skb)
{
	struct rtable *rt = skb_rtable(skb);
	int iif = rt->rt_iif;

	if (iif)
		return iif;
	if (rt_is_output_route(rt))
		return rt->dst.dev->ifindex;
	return skb->skb_iif;
}

	but may be it is same if we restore the
rth->rt_iif	= orig_oif ? : dev_out->ifindex;

	code in __mkroute_output? By this way we will have
correct rt_iif at the time of routing call, before the
packet is looped via this device. Then skb_iif will remain
as default only for input routes. By this way we will be
less worried for places like xfrm or when skb->dst is
replaced in forwarding path. IPIP tunnels also replace
skb->dst but this can hurt only code like route4_classify.

Regards

--
Julian Anastasov <ja@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas de Pesloüan July 24, 2012, 10:13 p.m. UTC | #8
Le 24/07/2012 01:14, David Miller a écrit :
[...]
> I wonder if we should just get rid of all of that orig_dev logic and
> simply update skb->skb_iif every time we hit the code starting at
> label "another_round"

It clearly depends on the exact meaning of orig_dev.

When we studied the usage of orig_dev before removing it from bonding, it was clear that two 
different meanings existed:

- From the bonding point of view, is was "the device one level below current device" (the slave, 
from the master's point of view).

- From the af_packet point of view, is was "the real original device that received the packet".

As bonding don't use orig_dev anymore, the remaining meaning should logically be "the real original 
device that received the packet". But as __netif_receive_skb() is recursively called in many cases, 
setting orig_dev to something new every time, this meaning is probably mostly inconsistent. As such, 
it sounds appropriate to remove orig_dev and use skb_iif instead.

	Nicolas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller July 24, 2012, 10:18 p.m. UTC | #9
From: Nicolas de Pesloüan <nicolas.2p.debian@gmail.com>
Date: Wed, 25 Jul 2012 00:13:55 +0200

> - From the af_packet point of view, is was "the real original device
> - that received the packet".
> 
> As bonding don't use orig_dev anymore, the remaining meaning should
> logically be "the real original device that received the packet". But
> as __netif_receive_skb() is recursively called in many cases, setting
> orig_dev to something new every time, this meaning is probably mostly
> inconsistent. As such, it sounds appropriate to remove orig_dev and
> use skb_iif instead.

I don't think we can, otherwise people who set po->origdev will no
longer get what they expect.

For the simpler cases of bonding and VLANs, it does currently behaved
as expected.

That's why I left it alone.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas de Pesloüan July 24, 2012, 10:25 p.m. UTC | #10
Le 25/07/2012 00:18, David Miller a écrit :
> From: Nicolas de Pesloüan<nicolas.2p.debian@gmail.com>
> Date: Wed, 25 Jul 2012 00:13:55 +0200
>
>> - From the af_packet point of view, is was "the real original device
>> - that received the packet".
>>
>> As bonding don't use orig_dev anymore, the remaining meaning should
>> logically be "the real original device that received the packet". But
>> as __netif_receive_skb() is recursively called in many cases, setting
>> orig_dev to something new every time, this meaning is probably mostly
>> inconsistent. As such, it sounds appropriate to remove orig_dev and
>> use skb_iif instead.
>
> I don't think we can, otherwise people who set po->origdev will no
> longer get what they expect.
>
> For the simpler cases of bonding and VLANs, it does currently behaved
> as expected.
>
> That's why I left it alone.

Do they get what they expect when stacking interfaces?

__netif_receive_skb starts with orig_dev = skb->dev. So when calling __netif_receive_skb 
recursively, after changing skb->dev, they get the packet several times, with a different orig_dev 
value?

Any way, both looks good to me.

	Nicolas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/route.h b/include/net/route.h
index 60d611d..d418ae1 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -277,7 +277,11 @@  static inline struct rtable *ip_route_newports(struct flowi4 *fl4, struct rtable
 
 static inline int inet_iif(const struct sk_buff *skb)
 {
-	return skb_rtable(skb)->rt_iif;
+	int iif = skb_rtable(skb)->rt_iif;
+
+	if (iif)
+		return iif;
+	return skb->dev->ifindex;
 }
 
 extern int sysctl_ip_default_ttl;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f6be781..6bcb8fc 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1309,7 +1309,7 @@  static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	rth->rt_flags	= RTCF_MULTICAST;
 	rth->rt_type	= RTN_MULTICAST;
 	rth->rt_is_input= 1;
-	rth->rt_iif	= dev->ifindex;
+	rth->rt_iif	= 0;
 	rth->rt_pmtu	= 0;
 	rth->rt_gateway	= 0;
 	if (our) {
@@ -1435,7 +1435,7 @@  static int __mkroute_input(struct sk_buff *skb,
 	rth->rt_flags = flags;
 	rth->rt_type = res->type;
 	rth->rt_is_input = 1;
-	rth->rt_iif 	= in_dev->dev->ifindex;
+	rth->rt_iif 	= 0;
 	rth->rt_pmtu	= 0;
 	rth->rt_gateway	= 0;
 
@@ -1608,7 +1608,7 @@  local_input:
 	rth->rt_flags 	= flags|RTCF_LOCAL;
 	rth->rt_type	= res.type;
 	rth->rt_is_input = 1;
-	rth->rt_iif	= dev->ifindex;
+	rth->rt_iif	= 0;
 	rth->rt_pmtu	= 0;
 	rth->rt_gateway	= 0;
 	if (res.type == RTN_UNREACHABLE) {
@@ -1772,7 +1772,7 @@  static struct rtable *__mkroute_output(const struct fib_result *res,
 	rth->rt_flags	= flags;
 	rth->rt_type	= type;
 	rth->rt_is_input = 0;
-	rth->rt_iif	= orig_oif ? : dev_out->ifindex;
+	rth->rt_iif	= orig_oif ? : 0;
 	rth->rt_pmtu	= 0;
 	rth->rt_gateway = 0;