diff mbox series

net: fix routing encapsulated packets when binding a socket to a tunnel interface

Message ID 1553236198-5955-1-git-send-email-lifonghsu@synology.com
State Changes Requested
Delegated to: David Miller
Headers show
Series net: fix routing encapsulated packets when binding a socket to a tunnel interface | expand

Commit Message

lifonghsu March 22, 2019, 6:29 a.m. UTC
From: LiFong Hsu <lifonghsu@synology.com>

When binding a socket to a 4in6/6in4 tunnel interface, the kernel sends
the packet to the tunnel interface without any problem, e.g.,
ping 8.8.8.8 -I 4in6. However, after the 4in6/6in4 tunnel encapsulation,
the encapsulated packet could be sent to the tunnel interface again when
some fields of the skb were changed in mangle table's output chain,
such as skb->mark and src/dest IP address. Sending to the tunnel interface
twice is unexpected, since there are no corresponding routing rules on
the tunnel interface for the encapsulated packet. Eventually, the encapsulated
packet will be dropped by the tunnel interface.

This commit stops referring to sk_bound_dev_if while re-routing a packet
with skb_iif!=0 which indicates that the packet has already been sent to
the interface specified by sk_bound_dev_if. Instead, this commit sends
the packet to the underlying network device.

Signed-off-by: LiFong Hsu <lifonghsu@synology.com>
Reviewed-by: JianJhen Chen <kchen@synology.com>
---
 net/ipv4/netfilter.c  | 2 +-
 net/ipv6/ip6_tunnel.c | 3 +++
 net/ipv6/netfilter.c  | 2 +-
 net/ipv6/sit.c        | 4 ++++
 4 files changed, 9 insertions(+), 2 deletions(-)

Comments

David Miller March 25, 2019, 11:16 p.m. UTC | #1
From: lifonghsu <lifonghsu@synology.com>
Date: Fri, 22 Mar 2019 14:29:58 +0800

> From: LiFong Hsu <lifonghsu@synology.com>
> 
> When binding a socket to a 4in6/6in4 tunnel interface, the kernel sends
> the packet to the tunnel interface without any problem, e.g.,
> ping 8.8.8.8 -I 4in6. However, after the 4in6/6in4 tunnel encapsulation,
> the encapsulated packet could be sent to the tunnel interface again when
> some fields of the skb were changed in mangle table's output chain,
> such as skb->mark and src/dest IP address. Sending to the tunnel interface
> twice is unexpected, since there are no corresponding routing rules on
> the tunnel interface for the encapsulated packet. Eventually, the encapsulated
> packet will be dropped by the tunnel interface.
> 
> This commit stops referring to sk_bound_dev_if while re-routing a packet
> with skb_iif!=0 which indicates that the packet has already been sent to
> the interface specified by sk_bound_dev_if. Instead, this commit sends
> the packet to the underlying network device.
> 
> Signed-off-by: LiFong Hsu <lifonghsu@synology.com>
> Reviewed-by: JianJhen Chen <kchen@synology.com>

skb->skb_iif is a receive side indication, why would it be changed on
transmit?

I see mac802154 doing this, but what it's doing is somewhat broken and
that doesn't come into play in your example.
David Miller March 27, 2019, 8:44 p.m. UTC | #2
From: lifonghsu <lifonghsu@synology.com>
Date: Fri, 22 Mar 2019 14:29:58 +0800

> From: LiFong Hsu <lifonghsu@synology.com>
> 
> When binding a socket to a 4in6/6in4 tunnel interface, the kernel sends
> the packet to the tunnel interface without any problem, e.g.,
> ping 8.8.8.8 -I 4in6. However, after the 4in6/6in4 tunnel encapsulation,
> the encapsulated packet could be sent to the tunnel interface again when
> some fields of the skb were changed in mangle table's output chain,
> such as skb->mark and src/dest IP address. Sending to the tunnel interface
> twice is unexpected, since there are no corresponding routing rules on
> the tunnel interface for the encapsulated packet. Eventually, the encapsulated
> packet will be dropped by the tunnel interface.
> 
> This commit stops referring to sk_bound_dev_if while re-routing a packet
> with skb_iif!=0 which indicates that the packet has already been sent to
> the interface specified by sk_bound_dev_if. Instead, this commit sends
> the packet to the underlying network device.
> 
> Signed-off-by: LiFong Hsu <lifonghsu@synology.com>
> Reviewed-by: JianJhen Chen <kchen@synology.com>

If you do not respond to my feedback, I am going to drop your patch.

You have 24 hours to respond before that happens.
lifonghsu March 28, 2019, 8:30 a.m. UTC | #3
David Miller 於 2019-03-28 04:44 寫到:
> From: lifonghsu <lifonghsu@synology.com>
> Date: Fri, 22 Mar 2019 14:29:58 +0800
> 
>> From: LiFong Hsu <lifonghsu@synology.com>
>> 
>> When binding a socket to a 4in6/6in4 tunnel interface, the kernel 
>> sends
>> the packet to the tunnel interface without any problem, e.g.,
>> ping 8.8.8.8 -I 4in6. However, after the 4in6/6in4 tunnel 
>> encapsulation,
>> the encapsulated packet could be sent to the tunnel interface again 
>> when
>> some fields of the skb were changed in mangle table's output chain,
>> such as skb->mark and src/dest IP address. Sending to the tunnel 
>> interface
>> twice is unexpected, since there are no corresponding routing rules on
>> the tunnel interface for the encapsulated packet. Eventually, the 
>> encapsulated
>> packet will be dropped by the tunnel interface.
>> 
>> This commit stops referring to sk_bound_dev_if while re-routing a 
>> packet
>> with skb_iif!=0 which indicates that the packet has already been sent 
>> to
>> the interface specified by sk_bound_dev_if. Instead, this commit sends
>> the packet to the underlying network device.
>> 
>> Signed-off-by: LiFong Hsu <lifonghsu@synology.com>
>> Reviewed-by: JianJhen Chen <kchen@synology.com>
> 
> If you do not respond to my feedback, I am going to drop your patch.
> 
> You have 24 hours to respond before that happens.

Sorry for the late reply.

> skb->skb_iif is a receive side indication, why would it be changed on 
> transmit?
Indeed, skb_iif is used as receive site indication to present "device 
the packet arrived on".
This commit keeps the previous arrived device (similar to the concept of 
"device the packet arrived on") in skb_iif field to prevent kernel from 
referring sk_bound_dev_if again. Otherwise, we might need to add a new 
field to sk_buff structure for our purpose.

An example of how this commit works with 4in6 tunnel device is listed as 
follows:
socket bind 4in6 device (e.g. ping -I 4in6 8.8.8.8)
-> skb is transmitted to 4in6 device by referring sk_bound_dev_if
-> encapsulate the IPv4 packet to an IPv6 packet
-> "this commit sets skb_iif to 4in6 device as a flag"
-> ip6tunnel_xmit
-> re-route the encapsulated IPv6 packet
-> "this commit prevents kernel from referring sk_bound_dev_if again by 
checking skb_iif, so the encapsulated IPv6 packet will be sent to the 
underlying device instead of the 4in6 device"
-> underlying device

> I see mac802154 doing this, but what it's doing is somewhat broken and 
> that doesn't come into play in your example.

We have not tested mac802154, but we think this commit should be 
compatible with mac802154. Considering mac802154 is an underlying device 
in the above mentioned example. The skb_iif will be overwritten by 
mac802154 at the end. So the behavior of mac802154 should not be 
affected by this commit.

Thanks.
David Miller March 29, 2019, 5:40 p.m. UTC | #4
From: lifonghsu <lifonghsu@synology.com>
Date: Thu, 28 Mar 2019 16:30:37 +0800

> Indeed, skb_iif is used as receive site indication to present "device
> the packet arrived on".
> This commit keeps the previous arrived device (similar to the concept
> of "device the packet arrived on") in skb_iif field to prevent kernel
> from referring sk_bound_dev_if again. Otherwise, we might need to add
> a new field to sk_buff structure for our purpose.

Therefore, you are deciding to arbitrarily repurpose an RX side piece
of state for TX purposes.

Do not do this.

It confuses anyone trying to understand how skb_iif works.

You must use something with a different name, and clear semantics, to
achieve this goal.

For example, you could use an anonymous union:

	union {
		int	skb_iif;
		bool	bound_dev_already_applied;
	};

You never actually _USE_ the value of skb_iif, it is just merely a
boolean indicating whether sk_bound_dev_if was applied already.
lifonghsu April 1, 2019, 3 a.m. UTC | #5
David Miller 於 2019-03-30 01:40 寫到:
> From: lifonghsu <lifonghsu@synology.com>
> Date: Thu, 28 Mar 2019 16:30:37 +0800
> 
>> Indeed, skb_iif is used as receive site indication to present "device
>> the packet arrived on".
>> This commit keeps the previous arrived device (similar to the concept
>> of "device the packet arrived on") in skb_iif field to prevent kernel
>> from referring sk_bound_dev_if again. Otherwise, we might need to add
>> a new field to sk_buff structure for our purpose.
> 
> Therefore, you are deciding to arbitrarily repurpose an RX side piece
> of state for TX purposes.
> 
> Do not do this.
> 
> It confuses anyone trying to understand how skb_iif works.
> 
> You must use something with a different name, and clear semantics, to
> achieve this goal.
> 
> For example, you could use an anonymous union:
> 
> 	union {
> 		int	skb_iif;
> 		bool	bound_dev_already_applied;
> 	};
> 
> You never actually _USE_ the value of skb_iif, it is just merely a
> boolean indicating whether sk_bound_dev_if was applied already.

Since we are not sure whether there is any code referring to the value 
of
skb_iff in transmit site or not, it is risky to set an invalid interface
index to skb_iif in an anonymous union.
What if we add a bit (e.g., __u8 bound_dev_already_applied:1;) to 
sk_buff?
David Miller April 1, 2019, 6:01 a.m. UTC | #6
From: lifonghsu <lifonghsu@synology.com>
Date: Mon, 01 Apr 2019 11:00:22 +0800

> Since we are not sure whether there is any code referring to the
> value of skb_iff in transmit site or not, it is risky to set an
> invalid interface

If you are no confident in how the value is used on TX, then I can't
consider your change with confidence either.

Think about what you are saying to me here.  You basically do not know
whether you are adding a bug or not.
lifonghsu April 1, 2019, 9:52 a.m. UTC | #7
David Miller 於 2019-04-01 14:01 寫到:
> From: lifonghsu <lifonghsu@synology.com>
> Date: Mon, 01 Apr 2019 11:00:22 +0800
> 
>> Since we are not sure whether there is any code referring to the
>> value of skb_iff in transmit site or not, it is risky to set an
>> invalid interface
> 
> If you are no confident in how the value is used on TX, then I can't
> consider your change with confidence either.
> 
> Think about what you are saying to me here.  You basically do not know
> whether you are adding a bug or not.

Indeed, we are not confident about the commit. Please drop it for now.
Thank you for help review the commit.
diff mbox series

Patch

diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
index 8d2e5dc9..426c56b 100644
--- a/net/ipv4/netfilter.c
+++ b/net/ipv4/netfilter.c
@@ -41,7 +41,7 @@  int ip_route_me_harder(struct net *net, struct sk_buff *skb, unsigned int addr_t
 	fl4.daddr = iph->daddr;
 	fl4.saddr = saddr;
 	fl4.flowi4_tos = RT_TOS(iph->tos);
-	fl4.flowi4_oif = sk ? sk->sk_bound_dev_if : 0;
+	fl4.flowi4_oif = (sk && !skb->skb_iif) ? sk->sk_bound_dev_if : 0;
 	if (!fl4.flowi4_oif)
 		fl4.flowi4_oif = l3mdev_master_ifindex(dev);
 	fl4.flowi4_mark = skb->mark;
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 0c6403c..6997599 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1220,6 +1220,9 @@  int ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev, __u8 dsfield,
 	ipv6h->nexthdr = proto;
 	ipv6h->saddr = fl6->saddr;
 	ipv6h->daddr = fl6->daddr;
+
+	/* Reset the skb_iif to Tunnels interface index */
+	skb->skb_iif = dev->ifindex;
 	ip6tunnel_xmit(NULL, skb, dev);
 	return 0;
 tx_err_link_failure:
diff --git a/net/ipv6/netfilter.c b/net/ipv6/netfilter.c
index 6d0b1f3..55cc431 100644
--- a/net/ipv6/netfilter.c
+++ b/net/ipv6/netfilter.c
@@ -26,7 +26,7 @@  int ip6_route_me_harder(struct net *net, struct sk_buff *skb)
 	int strict = (ipv6_addr_type(&iph->daddr) &
 		      (IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL));
 	struct flowi6 fl6 = {
-		.flowi6_oif = sk && sk->sk_bound_dev_if ? sk->sk_bound_dev_if :
+		.flowi6_oif = (sk && sk->sk_bound_dev_if && !skb->skb_iif) ? sk->sk_bound_dev_if :
 			strict ? skb_dst(skb)->dev->ifindex : 0,
 		.flowi6_mark = skb->mark,
 		.flowi6_uid = sock_net_uid(net, sk),
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index e8a1dab..1112ef2 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -988,6 +988,8 @@  static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
 
 	skb_set_inner_ipproto(skb, IPPROTO_IPV6);
 
+	/* Reset the skb_iif to Tunnels interface index */
+	skb->skb_iif = dev->ifindex;
 	iptunnel_xmit(NULL, rt, skb, fl4.saddr, fl4.daddr, protocol, tos, ttl,
 		      df, !net_eq(tunnel->net, dev_net(dev)));
 	return NETDEV_TX_OK;
@@ -1011,6 +1013,8 @@  static netdev_tx_t sit_tunnel_xmit__(struct sk_buff *skb,
 
 	skb_set_inner_ipproto(skb, ipproto);
 
+	/* Reset the skb_iif to Tunnels interface index */
+	skb->skb_iif = dev->ifindex;
 	ip_tunnel_xmit(skb, dev, tiph, ipproto);
 	return NETDEV_TX_OK;
 tx_error: