diff mbox

[net-next] xfrm: Call IP receive handler directly for inbound tunnel-mode packets

Message ID 1325475154-15997-1-git-send-email-david.ward@ll.mit.edu
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

David Ward Jan. 2, 2012, 3:32 a.m. UTC
For IPsec tunnel mode (or BEET mode), after inbound packets are xfrm'ed,
call the IPv4/IPv6 receive handler directly instead of calling netif_rx.
In addition to avoiding unneeded re-processing of the MAC layer, packets
will not be received a second time on network taps. (Note that outbound
packets are only received on network taps post-xfrm, but inbound packets
were being received both pre- and post-xfrm. So now network taps will
receive packets in either direction only once, in the form that they go
"over the wire".)

Signed-off-by: David Ward <david.ward@ll.mit.edu>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
---
 include/net/xfrm.h     |    3 +++
 net/ipv4/xfrm4_input.c |    5 +++++
 net/ipv4/xfrm4_state.c |    1 +
 net/ipv6/xfrm6_input.c |    5 +++++
 net/ipv6/xfrm6_state.c |    1 +
 net/xfrm/xfrm_input.c  |    4 +++-
 6 files changed, 18 insertions(+), 1 deletions(-)

Comments

Herbert Xu Jan. 2, 2012, 7:28 a.m. UTC | #1
On Sun, Jan 01, 2012 at 10:32:34PM -0500, David Ward wrote:
> For IPsec tunnel mode (or BEET mode), after inbound packets are xfrm'ed,
> call the IPv4/IPv6 receive handler directly instead of calling netif_rx.
> In addition to avoiding unneeded re-processing of the MAC layer, packets
> will not be received a second time on network taps. (Note that outbound
> packets are only received on network taps post-xfrm, but inbound packets
> were being received both pre- and post-xfrm. So now network taps will
> receive packets in either direction only once, in the form that they go
> "over the wire".)
> 
> Signed-off-by: David Ward <david.ward@ll.mit.edu>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>

You can't do this as this may cause stack overruns if we nest
too deeply.

Changing the existing tap processing behaviour will also break
existing setups.

Cheers,
Eric Dumazet Jan. 2, 2012, 8:18 a.m. UTC | #2
Le lundi 02 janvier 2012 à 18:28 +1100, Herbert Xu a écrit :

> You can't do this as this may cause stack overruns if we nest
> too deeply.
> 

I was considering adding a generic helper, for tunneling,
taking into account the depth for current packet.

[ calling netif_receive_skb() instead of netif_rx(), to solve the OOO
problem occurring on SMP when interrupts are spreaded on several cpus ]


We could use the delta between skb->data and skb->head as an estimation
of this depth, in order not adding a new skb field ?

#define DEPTH_THRESHOLD (NET_SKB_PAD + 64)

static inline void netif_reinject(struct sk_buff *skb)
{
if (skb->data - skb->head < DEPTH_THRESHOLD)
	netif_receive_skb(skb);
else
	netif_rx(skb);
}


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Ward Jan. 2, 2012, 7:52 p.m. UTC | #3
Hi Herbert,

On 01/02/2012 02:28 AM, Herbert Xu wrote:
> On Sun, Jan 01, 2012 at 10:32:34PM -0500, David Ward wrote:
>> For IPsec tunnel mode (or BEET mode), after inbound packets are xfrm'ed,
>> call the IPv4/IPv6 receive handler directly instead of calling netif_rx.
>> In addition to avoiding unneeded re-processing of the MAC layer, packets
>> will not be received a second time on network taps. (Note that outbound
>> packets are only received on network taps post-xfrm, but inbound packets
>> were being received both pre- and post-xfrm. So now network taps will
>> receive packets in either direction only once, in the form that they go
>> "over the wire".)
>>
>> Signed-off-by: David Ward<david.ward@ll.mit.edu>
>> Cc: Herbert Xu<herbert@gondor.apana.org.au>
> You can't do this as this may cause stack overruns if we nest
> too deeply.
Sorry if I'm missing something, but how are such overruns avoided on the 
outbound side?

> Changing the existing tap processing behaviour will also break
> existing setups.
Assuming there might be a better way to make this change, are there 
examples of existing setups that would be negatively affected? From my 
perspective this behavior is just an unintended artifact of xfrm'ed 
packets being placed back into netif_rx, which only occurs for inbound 
packets, and it complicates the usage of network taps on these 
interfaces (i.e. how do you systematically determine whether any packet 
is post-xfrm and was already seen in an earlier form?). It seems to me 
that network taps operate at a lower layer than xfrm, and so xfrm should 
be invisible to the network taps. If users are, for example, capturing 
ESP packets from a PF_PACKET socket and want to examine the decrypted 
payload, I think the capture application should be responsible for the 
decryption, just as it would be at higher layers with something like 
SSL/TLS (and again for example, both protocols can be decrypted by 
Wireshark when provided the keys).

I would appreciate your feedback.

David
David Miller Jan. 3, 2012, 5:56 p.m. UTC | #4
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 02 Jan 2012 09:18:02 +0100

> if (skb->data - skb->head < DEPTH_THRESHOLD)
> 	netif_receive_skb(skb);

Fundamentally I think such things are doomed to failure.

I encourage you to instead look into the idea proposed the other year
(but unfortunately I found no time to implement) wherein we have a
top-level looping structure.

The scheme was originally proposed for TX but we can do it just as easily
for RX too.  Essentially the entity that begins the traversal into the
packet send or receive path makes a mark in some per-cpu data structure.

When we return to the mark setting spot, we check if any "continued
processing" work got queued there, and run it if so, keeping the mark
set.  Once the queued work is rechecked and found to be all clear, we
clear the mark and finish.

This has performance benefits too because on both the TX and RX side
we'll stop this whole dance where we schedule a SW interrupt and incr
all the overhead necessary to do that.

It's going to be faster than your threshold test scheme because we'll
be using a smaller stack frame and thus get better cache hits there.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 11, 2012, 4:45 a.m. UTC | #5
On Mon, Jan 02, 2012 at 02:52:36PM -0500, Ward, David - 0663 - MITLL wrote:
>
> Sorry if I'm missing something, but how are such overruns avoided on the  
> outbound side?

We use tail calls on the output path.

> Assuming there might be a better way to make this change, are there  
> examples of existing setups that would be negatively affected? From my  
> perspective this behavior is just an unintended artifact of xfrm'ed  
> packets being placed back into netif_rx, which only occurs for inbound  
> packets, and it complicates the usage of network taps on these  
> interfaces (i.e. how do you systematically determine whether any packet  
> is post-xfrm and was already seen in an earlier form?). It seems to me  
> that network taps operate at a lower layer than xfrm, and so xfrm should  
> be invisible to the network taps. If users are, for example, capturing  
> ESP packets from a PF_PACKET socket and want to examine the decrypted  
> payload, I think the capture application should be responsible for the  
> decryption, just as it would be at higher layers with something like  
> SSL/TLS (and again for example, both protocols can be decrypted by  
> Wireshark when provided the keys).

While I sympathise with your argument, doing it nearly 10 years
after this behaviour was implemented is just too dangerous IMHO.

Cheers,
diff mbox

Patch

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index b203e14..423a779 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -329,6 +329,7 @@  struct xfrm_state_afinfo {
 						 struct sk_buff *skb);
 	int			(*extract_output)(struct xfrm_state *x,
 						  struct sk_buff *skb);
+	int			(*tunnel_finish)(struct sk_buff *skb);
 	int			(*transport_finish)(struct sk_buff *skb,
 						    int async);
 };
@@ -1453,6 +1454,7 @@  extern int xfrm4_extract_header(struct sk_buff *skb);
 extern int xfrm4_extract_input(struct xfrm_state *x, struct sk_buff *skb);
 extern int xfrm4_rcv_encap(struct sk_buff *skb, int nexthdr, __be32 spi,
 			   int encap_type);
+extern int xfrm4_tunnel_finish(struct sk_buff *skb);
 extern int xfrm4_transport_finish(struct sk_buff *skb, int async);
 extern int xfrm4_rcv(struct sk_buff *skb);
 
@@ -1470,6 +1472,7 @@  extern int xfrm4_tunnel_deregister(struct xfrm_tunnel *handler, unsigned short f
 extern int xfrm6_extract_header(struct sk_buff *skb);
 extern int xfrm6_extract_input(struct xfrm_state *x, struct sk_buff *skb);
 extern int xfrm6_rcv_spi(struct sk_buff *skb, int nexthdr, __be32 spi);
+extern int xfrm6_tunnel_finish(struct sk_buff *skb);
 extern int xfrm6_transport_finish(struct sk_buff *skb, int async);
 extern int xfrm6_rcv(struct sk_buff *skb);
 extern int xfrm6_input_addr(struct sk_buff *skb, xfrm_address_t *daddr,
diff --git a/net/ipv4/xfrm4_input.c b/net/ipv4/xfrm4_input.c
index 06814b6..4903a01 100644
--- a/net/ipv4/xfrm4_input.c
+++ b/net/ipv4/xfrm4_input.c
@@ -46,6 +46,11 @@  int xfrm4_rcv_encap(struct sk_buff *skb, int nexthdr, __be32 spi,
 }
 EXPORT_SYMBOL(xfrm4_rcv_encap);
 
+int xfrm4_tunnel_finish(struct sk_buff *skb)
+{
+	return ip_rcv(skb, skb->dev, NULL, skb->dev);
+}
+
 int xfrm4_transport_finish(struct sk_buff *skb, int async)
 {
 	struct iphdr *iph = ip_hdr(skb);
diff --git a/net/ipv4/xfrm4_state.c b/net/ipv4/xfrm4_state.c
index 9258e75..1931c42 100644
--- a/net/ipv4/xfrm4_state.c
+++ b/net/ipv4/xfrm4_state.c
@@ -82,6 +82,7 @@  static struct xfrm_state_afinfo xfrm4_state_afinfo = {
 	.output_finish		= xfrm4_output_finish,
 	.extract_input		= xfrm4_extract_input,
 	.extract_output		= xfrm4_extract_output,
+	.tunnel_finish		= xfrm4_tunnel_finish,
 	.transport_finish	= xfrm4_transport_finish,
 };
 
diff --git a/net/ipv6/xfrm6_input.c b/net/ipv6/xfrm6_input.c
index f8c3cf8..dc898a8 100644
--- a/net/ipv6/xfrm6_input.c
+++ b/net/ipv6/xfrm6_input.c
@@ -29,6 +29,11 @@  int xfrm6_rcv_spi(struct sk_buff *skb, int nexthdr, __be32 spi)
 }
 EXPORT_SYMBOL(xfrm6_rcv_spi);
 
+int xfrm6_tunnel_finish(struct sk_buff *skb)
+{
+	return ipv6_rcv(skb, skb->dev, NULL, skb->dev);
+}
+
 int xfrm6_transport_finish(struct sk_buff *skb, int async)
 {
 	skb_network_header(skb)[IP6CB(skb)->nhoff] =
diff --git a/net/ipv6/xfrm6_state.c b/net/ipv6/xfrm6_state.c
index f2d72b8..51d31c3 100644
--- a/net/ipv6/xfrm6_state.c
+++ b/net/ipv6/xfrm6_state.c
@@ -182,6 +182,7 @@  static struct xfrm_state_afinfo xfrm6_state_afinfo = {
 	.output_finish		= xfrm6_output_finish,
 	.extract_input		= xfrm6_extract_input,
 	.extract_output		= xfrm6_extract_output,
+	.tunnel_finish		= xfrm6_tunnel_finish,
 	.transport_finish	= xfrm6_transport_finish,
 };
 
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 54a0dc2..571af71 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -262,7 +262,9 @@  resume:
 
 	if (decaps) {
 		skb_dst_drop(skb);
-		netif_rx(skb);
+		skb_reset_network_header(skb);
+		skb_reset_transport_header(skb);
+		x->inner_mode->afinfo->tunnel_finish(skb);
 		return 0;
 	} else {
 		return x->inner_mode->afinfo->transport_finish(skb, async);