diff mbox

tun: orphan an skb on tx

Message ID 20100413145944.GA7716@redhat.com
State New
Headers show

Commit Message

Michael S. Tsirkin April 13, 2010, 2:59 p.m. UTC
The following situation was observed in the field:
tap1 sends packets, tap2 does not consume them, as a result
tap1 can not be closed. This happens because
tun/tap devices can hang on to skbs undefinitely.

As noted by Herbert, possible solutions include a timeout followed by a
copy/change of ownership of the skb, or always copying/changing
ownership if we're going into a hostile device.

This patch implements the second approach.

Note: one issue still remaining is that since skbs
keep reference to tun socket and tun socket has a
reference to tun device, we won't flush backlog,
instead simply waiting for all skbs to get transmitted.
At least this is not user-triggerable, and
this was not reported in practice, my assumption is
other devices besides tap complete an skb
within finite time after it has been queued.

A possible solution for the second issue
would not to have socket reference the device,
instead, implement dev->destructor for tun, and
wait for all skbs to complete there, but this
needs some thought, probably too risky for 2.6.34.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Yan Vugenfirer <yvugenfi@redhat.com>

---

Please review the below, and consider for 2.6.34,
and stable trees.

 drivers/net/tun.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

Comments

Herbert Xu April 13, 2010, 3:12 p.m. UTC | #1
On Tue, Apr 13, 2010 at 05:59:44PM +0300, Michael S. Tsirkin wrote:
> The following situation was observed in the field:
> tap1 sends packets, tap2 does not consume them, as a result
> tap1 can not be closed. This happens because
> tun/tap devices can hang on to skbs undefinitely.
> 
> As noted by Herbert, possible solutions include a timeout followed by a
> copy/change of ownership of the skb, or always copying/changing
> ownership if we're going into a hostile device.
> 
> This patch implements the second approach.
> 
> Note: one issue still remaining is that since skbs
> keep reference to tun socket and tun socket has a
> reference to tun device, we won't flush backlog,
> instead simply waiting for all skbs to get transmitted.
> At least this is not user-triggerable, and
> this was not reported in practice, my assumption is
> other devices besides tap complete an skb
> within finite time after it has been queued.
> 
> A possible solution for the second issue
> would not to have socket reference the device,
> instead, implement dev->destructor for tun, and
> wait for all skbs to complete there, but this
> needs some thought, probably too risky for 2.6.34.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Tested-by: Yan Vugenfirer <yvugenfi@redhat.com>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

Thanks,
Jan Kiszka April 13, 2010, 3:36 p.m. UTC | #2
Michael S. Tsirkin wrote:
> The following situation was observed in the field:
> tap1 sends packets, tap2 does not consume them, as a result
> tap1 can not be closed.

And before that, tap1 may not be able to send further packets to anyone
else on the bridge as its TX resources were blocked by tap2 - that's
what we saw in the field.

> This happens because
> tun/tap devices can hang on to skbs undefinitely.
> 
> As noted by Herbert, possible solutions include a timeout followed by a
> copy/change of ownership of the skb, or always copying/changing
> ownership if we're going into a hostile device.
> 
> This patch implements the second approach.
> 
> Note: one issue still remaining is that since skbs
> keep reference to tun socket and tun socket has a
> reference to tun device, we won't flush backlog,
> instead simply waiting for all skbs to get transmitted.
> At least this is not user-triggerable, and
> this was not reported in practice, my assumption is
> other devices besides tap complete an skb
> within finite time after it has been queued.
> 
> A possible solution for the second issue
> would not to have socket reference the device,
> instead, implement dev->destructor for tun, and
> wait for all skbs to complete there, but this
> needs some thought, probably too risky for 2.6.34.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Tested-by: Yan Vugenfirer <yvugenfi@redhat.com>
> 
> ---
> 
> Please review the below, and consider for 2.6.34,
> and stable trees.
> 
>  drivers/net/tun.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 96c39bd..4326520 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>  		}
>  	}
>  
> +	/* Orphan the skb - required as we might hang on to it
> +	 * for indefinite time. */
> +	skb_orphan(skb);
> +
>  	/* Enqueue packet */
>  	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
>  	dev->trans_start = jiffies;

Nice solution, thanks!

Jan
Eric Dumazet April 13, 2010, 4:40 p.m. UTC | #3
Le mardi 13 avril 2010 à 17:36 +0200, Jan Kiszka a écrit :
> Michael S. Tsirkin wrote:
> > The following situation was observed in the field:
> > tap1 sends packets, tap2 does not consume them, as a result
> > tap1 can not be closed.
> 
> And before that, tap1 may not be able to send further packets to anyone
> else on the bridge as its TX resources were blocked by tap2 - that's
> what we saw in the field.
> 

After the patch, tap1 is able to flood tap2, and tap3/tap4 not able to
send one single frame. Is it OK ?

Back to the problem : tap1 cannot be closed.

Why ? because of refcounts ?

When a socket with inflight tx packets is closed, we dont block the
close, we only delay the socket freeing once all packets were delivered
and freed.
Jan Kiszka April 13, 2010, 4:52 p.m. UTC | #4
Eric Dumazet wrote:
> Le mardi 13 avril 2010 à 17:36 +0200, Jan Kiszka a écrit :
>> Michael S. Tsirkin wrote:
>>> The following situation was observed in the field:
>>> tap1 sends packets, tap2 does not consume them, as a result
>>> tap1 can not be closed.
>> And before that, tap1 may not be able to send further packets to anyone
>> else on the bridge as its TX resources were blocked by tap2 - that's
>> what we saw in the field.
>>
> 
> After the patch, tap1 is able to flood tap2, and tap3/tap4 not able to
> send one single frame. Is it OK ?

I think if that's a real issue, you have to apply traffic shaping to the
untrusted nodes. The existing flow-control scheme was fragile anyway as
you had to translate packet lengths on TX side to packet counts on RX.

Jan
Michael S. Tsirkin April 13, 2010, 5:39 p.m. UTC | #5
On Tue, Apr 13, 2010 at 06:40:38PM +0200, Eric Dumazet wrote:
> Le mardi 13 avril 2010 à 17:36 +0200, Jan Kiszka a écrit :
> > Michael S. Tsirkin wrote:
> > > The following situation was observed in the field:
> > > tap1 sends packets, tap2 does not consume them, as a result
> > > tap1 can not be closed.
> > 
> > And before that, tap1 may not be able to send further packets to anyone
> > else on the bridge as its TX resources were blocked by tap2 - that's
> > what we saw in the field.
> > 
> 
> After the patch, tap1 is able to flood tap2, and tap3/tap4 not able to
> send one single frame. Is it OK ?

Yes :) This was always possible. Number of senders needed to flood
a receiver might vary depending on send/recv queue size
that you set. External sources can also fill your RX queue
if you let them. In the end, we need to rely on the scheduler for fairness,
or apply packet shaping.

> Back to the problem : tap1 cannot be closed.
> 
> Why ? because of refcounts ?

Yes.

> When a socket with inflight tx packets is closed, we dont block the
> close, we only delay the socket freeing once all packets were delivered
> and freed.
> 

Which is wrong, since this is under userspace control, so you get
unkillable processes.
Eric Dumazet April 13, 2010, 6:31 p.m. UTC | #6
Le mardi 13 avril 2010 à 20:39 +0300, Michael S. Tsirkin a écrit :

> > When a socket with inflight tx packets is closed, we dont block the
> > close, we only delay the socket freeing once all packets were delivered
> > and freed.
> > 
> 
> Which is wrong, since this is under userspace control, so you get
> unkillable processes.
> 

We do not get unkillable processes, at least with sockets I was thinking
about (TCP/UDP ones).

Maybe tun sockets can behave the same ?

Herbert Acked your patch, so I guess its OK, but I think it can be
dangerous.

Anyway my feeling is that we try to add various mechanisms to keep a
hostile user flooding another one.

For example, UDP got memory accounting quite recently, and we added
socket backlog limits very recently. It was considered not needed few
years ago.
Michael S. Tsirkin April 13, 2010, 8:25 p.m. UTC | #7
On Tue, Apr 13, 2010 at 08:31:03PM +0200, Eric Dumazet wrote:
> Le mardi 13 avril 2010 à 20:39 +0300, Michael S. Tsirkin a écrit :
> 
> > > When a socket with inflight tx packets is closed, we dont block the
> > > close, we only delay the socket freeing once all packets were delivered
> > > and freed.
> > > 
> > 
> > Which is wrong, since this is under userspace control, so you get
> > unkillable processes.
> > 
> 
> We do not get unkillable processes, at least with sockets I was thinking
> about (TCP/UDP ones).
> 
> Maybe tun sockets can behave the same ?

Looks like that's what my patch does: ip_rcv seems to call
skb_orphan too.

> Herbert Acked your patch, so I guess its OK, but I think it can be
> dangerous.
> Anyway my feeling is that we try to add various mechanisms to keep a
> hostile user flooding another one.
> 
> For example, UDP got memory accounting quite recently, and we added
> socket backlog limits very recently. It was considered not needed few
> years ago.
>
Eric Dumazet April 13, 2010, 8:38 p.m. UTC | #8
Le mardi 13 avril 2010 à 23:25 +0300, Michael S. Tsirkin a écrit :
> On Tue, Apr 13, 2010 at 08:31:03PM +0200, Eric Dumazet wrote:
> > Le mardi 13 avril 2010 à 20:39 +0300, Michael S. Tsirkin a écrit :
> > 
> > > > When a socket with inflight tx packets is closed, we dont block the
> > > > close, we only delay the socket freeing once all packets were delivered
> > > > and freed.
> > > > 
> > > 
> > > Which is wrong, since this is under userspace control, so you get
> > > unkillable processes.
> > > 
> > 
> > We do not get unkillable processes, at least with sockets I was thinking
> > about (TCP/UDP ones).
> > 
> > Maybe tun sockets can behave the same ?
> 
> Looks like that's what my patch does: ip_rcv seems to call
> skb_orphan too.

Well, I was speaking of tx side, you speak of receiving side.
An external flood (coming from another domain) is another problem.

A sender might flood the 'network' inside our domain. How can we
reasonably limit the sender ?

Maybe the answer is 'We can not', but it should be stated somewhere, so
that someone can address this point later.
Michael S. Tsirkin April 13, 2010, 8:43 p.m. UTC | #9
On Tue, Apr 13, 2010 at 10:38:06PM +0200, Eric Dumazet wrote:
> Le mardi 13 avril 2010 à 23:25 +0300, Michael S. Tsirkin a écrit :
> > On Tue, Apr 13, 2010 at 08:31:03PM +0200, Eric Dumazet wrote:
> > > Le mardi 13 avril 2010 à 20:39 +0300, Michael S. Tsirkin a écrit :
> > > 
> > > > > When a socket with inflight tx packets is closed, we dont block the
> > > > > close, we only delay the socket freeing once all packets were delivered
> > > > > and freed.
> > > > > 
> > > > 
> > > > Which is wrong, since this is under userspace control, so you get
> > > > unkillable processes.
> > > > 
> > > 
> > > We do not get unkillable processes, at least with sockets I was thinking
> > > about (TCP/UDP ones).
> > > 
> > > Maybe tun sockets can behave the same ?
> > 
> > Looks like that's what my patch does: ip_rcv seems to call
> > skb_orphan too.
> 
> Well, I was speaking of tx side, you speak of receiving side.

Point is, both ip_rcv and my patch call skb_orphan.

> An external flood (coming from another domain) is another problem.
> 
> A sender might flood the 'network' inside our domain. How can we
> reasonably limit the sender ?
> 
> Maybe the answer is 'We can not', but it should be stated somewhere, so
> that someone can address this point later.
> 

And whatever's done should ideally work for tap to IP
and IP to IP sockets as well, not just tap to tap.
Herbert Xu April 14, 2010, 12:58 a.m. UTC | #10
On Tue, Apr 13, 2010 at 08:31:03PM +0200, Eric Dumazet wrote:
>
> Herbert Acked your patch, so I guess its OK, but I think it can be
> dangerous.

The tun socket accounting was never designed to stop it from
flooding another tun interface.  It's there to stop it from
transmitting above a destination interface TX bandwidth and
cause unnecessary packet drops.  It also limits the total amount
of kernel memory that can be pinned down by a single tun interface.

In this case, all we're doing is shifting the accounting from the
"hardware" queue to the qdisc queue.

So your ability to flood a tun interface is essentially unchanged.

BTW we do the same thing in a number of hardware drivers, as well
as virtio-net.

Cheers,
David Miller April 14, 2010, 11:55 a.m. UTC | #11
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 14 Apr 2010 08:58:22 +0800

> On Tue, Apr 13, 2010 at 08:31:03PM +0200, Eric Dumazet wrote:
>>
>> Herbert Acked your patch, so I guess its OK, but I think it can be
>> dangerous.
> 
> The tun socket accounting was never designed to stop it from
> flooding another tun interface.  It's there to stop it from
> transmitting above a destination interface TX bandwidth and
> cause unnecessary packet drops.  It also limits the total amount
> of kernel memory that can be pinned down by a single tun interface.
> 
> In this case, all we're doing is shifting the accounting from the
> "hardware" queue to the qdisc queue.
> 
> So your ability to flood a tun interface is essentially unchanged.
> 
> BTW we do the same thing in a number of hardware drivers, as well
> as virtio-net.

Right.  Although this reminds me about the whole SKB
orphaning on xmit issue that keeps coming back to haunt
us.

If there weren't odd references to the SKB's socket in
the packet scheduler et al. we could just orphan these
things right upon entry to the qdisc and not have to
add hacks like this to every driver.

In fact... maybe we can just do it in dev_hard_queue_xmit()
since we are out of the qdisc at that point.... but I guess
there might be weird drivers that want the SKB socket in
their ->xmit routine...  Ho hum.

In any event that's net-next-2.6 exploratory material, and I've
applied this patch to net-2.6, thanks!
Michael S. Tsirkin April 21, 2010, 11:35 a.m. UTC | #12
On Tue, Apr 13, 2010 at 05:59:44PM +0300, Michael S. Tsirkin wrote:
> The following situation was observed in the field:
> tap1 sends packets, tap2 does not consume them, as a result
> tap1 can not be closed. This happens because
> tun/tap devices can hang on to skbs undefinitely.
> 
> As noted by Herbert, possible solutions include a timeout followed by a
> copy/change of ownership of the skb, or always copying/changing
> ownership if we're going into a hostile device.
> 
> This patch implements the second approach.
> 
> Note: one issue still remaining is that since skbs
> keep reference to tun socket and tun socket has a
> reference to tun device, we won't flush backlog,
> instead simply waiting for all skbs to get transmitted.
> At least this is not user-triggerable, and
> this was not reported in practice, my assumption is
> other devices besides tap complete an skb
> within finite time after it has been queued.
> 
> A possible solution for the second issue
> would not to have socket reference the device,
> instead, implement dev->destructor for tun, and
> wait for all skbs to complete there, but this
> needs some thought, probably too risky for 2.6.34.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Tested-by: Yan Vugenfirer <yvugenfi@redhat.com>
> 
> ---
> 
> Please review the below, and consider for 2.6.34,
> and stable trees.
> 
>  drivers/net/tun.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 96c39bd..4326520 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>  		}
>  	}
>  
> +	/* Orphan the skb - required as we might hang on to it
> +	 * for indefinite time. */
> +	skb_orphan(skb);
> +
>  	/* Enqueue packet */
>  	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
>  	dev->trans_start = jiffies;
> -- 
> 1.7.0.2.280.gc6f05

This is commit 0110d6f22f392f976e84ab49da1b42f85b64a3c5 in net-2.6
Please cherry-pick this fix in stable kernels (2.6.32 and 2.6.33).

Thanks!
Michael S. Tsirkin April 21, 2010, 11:45 a.m. UTC | #13
On Wed, Apr 21, 2010 at 01:46:33PM +0200, Jan Kiszka wrote:
> Michael S. Tsirkin wrote:
> > On Tue, Apr 13, 2010 at 05:59:44PM +0300, Michael S. Tsirkin wrote:
> >> The following situation was observed in the field:
> >> tap1 sends packets, tap2 does not consume them, as a result
> >> tap1 can not be closed. This happens because
> >> tun/tap devices can hang on to skbs undefinitely.
> >>
> >> As noted by Herbert, possible solutions include a timeout followed by a
> >> copy/change of ownership of the skb, or always copying/changing
> >> ownership if we're going into a hostile device.
> >>
> >> This patch implements the second approach.
> >>
> >> Note: one issue still remaining is that since skbs
> >> keep reference to tun socket and tun socket has a
> >> reference to tun device, we won't flush backlog,
> >> instead simply waiting for all skbs to get transmitted.
> >> At least this is not user-triggerable, and
> >> this was not reported in practice, my assumption is
> >> other devices besides tap complete an skb
> >> within finite time after it has been queued.
> >>
> >> A possible solution for the second issue
> >> would not to have socket reference the device,
> >> instead, implement dev->destructor for tun, and
> >> wait for all skbs to complete there, but this
> >> needs some thought, probably too risky for 2.6.34.
> >>
> >> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >> Tested-by: Yan Vugenfirer <yvugenfi@redhat.com>
> >>
> >> ---
> >>
> >> Please review the below, and consider for 2.6.34,
> >> and stable trees.
> >>
> >>  drivers/net/tun.c |    4 ++++
> >>  1 files changed, 4 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >> index 96c39bd..4326520 100644
> >> --- a/drivers/net/tun.c
> >> +++ b/drivers/net/tun.c
> >> @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>  		}
> >>  	}
> >>  
> >> +	/* Orphan the skb - required as we might hang on to it
> >> +	 * for indefinite time. */
> >> +	skb_orphan(skb);
> >> +
> >>  	/* Enqueue packet */
> >>  	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
> >>  	dev->trans_start = jiffies;
> >> -- 
> >> 1.7.0.2.280.gc6f05
> > 
> > This is commit 0110d6f22f392f976e84ab49da1b42f85b64a3c5 in net-2.6
> > Please cherry-pick this fix in stable kernels (2.6.32 and 2.6.33).
> > 
> > Thanks!
> > 
> 
> Before I forget again:
> 
> Tested-by: Jan Kiszka <jan.kiszka@siemens.com>
> (namely the field test of our customer)
> 
> Jan

Cool, thanks!

> -- 
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux
Jan Kiszka April 21, 2010, 11:46 a.m. UTC | #14
Michael S. Tsirkin wrote:
> On Tue, Apr 13, 2010 at 05:59:44PM +0300, Michael S. Tsirkin wrote:
>> The following situation was observed in the field:
>> tap1 sends packets, tap2 does not consume them, as a result
>> tap1 can not be closed. This happens because
>> tun/tap devices can hang on to skbs undefinitely.
>>
>> As noted by Herbert, possible solutions include a timeout followed by a
>> copy/change of ownership of the skb, or always copying/changing
>> ownership if we're going into a hostile device.
>>
>> This patch implements the second approach.
>>
>> Note: one issue still remaining is that since skbs
>> keep reference to tun socket and tun socket has a
>> reference to tun device, we won't flush backlog,
>> instead simply waiting for all skbs to get transmitted.
>> At least this is not user-triggerable, and
>> this was not reported in practice, my assumption is
>> other devices besides tap complete an skb
>> within finite time after it has been queued.
>>
>> A possible solution for the second issue
>> would not to have socket reference the device,
>> instead, implement dev->destructor for tun, and
>> wait for all skbs to complete there, but this
>> needs some thought, probably too risky for 2.6.34.
>>
>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> Tested-by: Yan Vugenfirer <yvugenfi@redhat.com>
>>
>> ---
>>
>> Please review the below, and consider for 2.6.34,
>> and stable trees.
>>
>>  drivers/net/tun.c |    4 ++++
>>  1 files changed, 4 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 96c39bd..4326520 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>  		}
>>  	}
>>  
>> +	/* Orphan the skb - required as we might hang on to it
>> +	 * for indefinite time. */
>> +	skb_orphan(skb);
>> +
>>  	/* Enqueue packet */
>>  	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
>>  	dev->trans_start = jiffies;
>> -- 
>> 1.7.0.2.280.gc6f05
> 
> This is commit 0110d6f22f392f976e84ab49da1b42f85b64a3c5 in net-2.6
> Please cherry-pick this fix in stable kernels (2.6.32 and 2.6.33).
> 
> Thanks!
> 

Before I forget again:

Tested-by: Jan Kiszka <jan.kiszka@siemens.com>
(namely the field test of our customer)

Jan
Greg KH April 21, 2010, 7:16 p.m. UTC | #15
On Wed, Apr 21, 2010 at 02:35:57PM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 13, 2010 at 05:59:44PM +0300, Michael S. Tsirkin wrote:
> > The following situation was observed in the field:
> > tap1 sends packets, tap2 does not consume them, as a result
> > tap1 can not be closed. This happens because
> > tun/tap devices can hang on to skbs undefinitely.
> > 
> > As noted by Herbert, possible solutions include a timeout followed by a
> > copy/change of ownership of the skb, or always copying/changing
> > ownership if we're going into a hostile device.
> > 
> > This patch implements the second approach.
> > 
> > Note: one issue still remaining is that since skbs
> > keep reference to tun socket and tun socket has a
> > reference to tun device, we won't flush backlog,
> > instead simply waiting for all skbs to get transmitted.
> > At least this is not user-triggerable, and
> > this was not reported in practice, my assumption is
> > other devices besides tap complete an skb
> > within finite time after it has been queued.
> > 
> > A possible solution for the second issue
> > would not to have socket reference the device,
> > instead, implement dev->destructor for tun, and
> > wait for all skbs to complete there, but this
> > needs some thought, probably too risky for 2.6.34.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Tested-by: Yan Vugenfirer <yvugenfi@redhat.com>
> > 
> > ---
> > 
> > Please review the below, and consider for 2.6.34,
> > and stable trees.
> > 
> >  drivers/net/tun.c |    4 ++++
> >  1 files changed, 4 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index 96c39bd..4326520 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >  		}
> >  	}
> >  
> > +	/* Orphan the skb - required as we might hang on to it
> > +	 * for indefinite time. */
> > +	skb_orphan(skb);
> > +
> >  	/* Enqueue packet */
> >  	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
> >  	dev->trans_start = jiffies;
> > -- 
> > 1.7.0.2.280.gc6f05
> 
> This is commit 0110d6f22f392f976e84ab49da1b42f85b64a3c5 in net-2.6
> Please cherry-pick this fix in stable kernels (2.6.32 and 2.6.33).

David Miller queues up the patches for the network subsystem for the
stable trees, and then forwards them to me when he feels they are ready.

So I'll defer to him on this one and wait for it to come from him.

thanks,

greg k-h
David Woodhouse Feb. 1, 2015, 11:20 a.m. UTC | #16
On Wed, 2010-04-14 at 08:58 +0800, Herbert Xu wrote:
> On Tue, Apr 13, 2010 at 08:31:03PM +0200, Eric Dumazet wrote:
> >
> > Herbert Acked your patch, so I guess its OK, but I think it can be
> > dangerous.
> 
> The tun socket accounting was never designed to stop it from
> flooding another tun interface.  It's there to stop it from
> transmitting above a destination interface TX bandwidth and
> cause unnecessary packet drops.  It also limits the total amount
> of kernel memory that can be pinned down by a single tun interface.
> 
> In this case, all we're doing is shifting the accounting from the
> "hardware" queue to the qdisc queue.
> 
> So your ability to flood a tun interface is essentially unchanged.

I've just been looking at VPN performance, using netperf to flood an
openconnect/ocserv connection over GigE and profiling my VPN client.

If I run netperf over the *unencrypted* link, it only sends 1Gb/s of
packets — because the packets are correctly accounted to netperf's UDP
socket until the moment they're actually transmitted on the wire, and
the backpressure works correctly.

When I run over the VPN, netperf thinks it sent 2½ times the amount of
TX traffic. Packets are being dropped by the tun device before even
feeding them up to the VPN client to be sent — presumably because of
this skb_orphan() call. (The client itself should do the right thing,
and only suck packets out of the tun at the rate it can shove them out
*its* UDP socket.)

Did we ever look at the alternative solution of taking ownership only
after a timeout, or on demand when we need to shut down the device?
Michael S. Tsirkin Feb. 1, 2015, 12:26 p.m. UTC | #17
On Sun, Feb 01, 2015 at 11:20:33AM +0000, David Woodhouse wrote:
> On Wed, 2010-04-14 at 08:58 +0800, Herbert Xu wrote:
> > On Tue, Apr 13, 2010 at 08:31:03PM +0200, Eric Dumazet wrote:
> > >
> > > Herbert Acked your patch, so I guess its OK, but I think it can be
> > > dangerous.
> > 
> > The tun socket accounting was never designed to stop it from
> > flooding another tun interface.  It's there to stop it from
> > transmitting above a destination interface TX bandwidth and
> > cause unnecessary packet drops.  It also limits the total amount
> > of kernel memory that can be pinned down by a single tun interface.
> > 
> > In this case, all we're doing is shifting the accounting from the
> > "hardware" queue to the qdisc queue.
> > 
> > So your ability to flood a tun interface is essentially unchanged.
> 
> I've just been looking at VPN performance, using netperf to flood an
> openconnect/ocserv connection over GigE and profiling my VPN client.
> 
> If I run netperf over the *unencrypted* link, it only sends 1Gb/s of
> packets — because the packets are correctly accounted to netperf's UDP
> socket until the moment they're actually transmitted on the wire, and
> the backpressure works correctly.
> 
> When I run over the VPN, netperf thinks it sent 2½ times the amount of
> TX traffic.

At some level, it's expected: netperf's manual actually says:
	A UDP_STREAM test has no end-to-end flow control - UDP provides none and
	neither does netperf. However, if you wish, you can configure netperf
	with --enable-intervals=yes to enable the global command-line -b and -w
	options to pace bursts of traffic onto the network.

> Packets are being dropped by the tun device before even
> feeding them up to the VPN client to be sent — presumably because of
> this skb_orphan() call. (The client itself should do the right thing,
> and only suck packets out of the tun at the rate it can shove them out
> *its* UDP socket.)

A simple work-around is to limit the rate using a non work conservig qdisc.

> Did we ever look at the alternative solution of taking ownership only
> after a timeout, or on demand when we need to shut down the device?

I've been thinking about this on and off, but didn't find a good
safe solution yet.

For timeout, the difficulty is to find a good timer value,
low enough to avoid DOS attacks but high enough to avoid
spurious packet drops (and expensive timer interrupts).
David Woodhouse Feb. 1, 2015, 1:33 p.m. UTC | #18
On Sun, 2015-02-01 at 14:26 +0200, Michael S. Tsirkin wrote:
> > When I run over the VPN, netperf thinks it sent 2½ times the amount of
> > TX traffic.
> 
> At some level, it's expected: netperf's manual actually says:
> 	A UDP_STREAM test has no end-to-end flow control - UDP provides none and
> 	neither does netperf. However, if you wish, you can configure netperf
> 	with --enable-intervals=yes to enable the global command-line -b and -w
> 	options to pace bursts of traffic onto the network.

True, but UDP is just a canary for other protocols here. We *should* be
able to keep track of the packets before they even leave the box, and
know that we haven't even managed to send them. Even if we know it's a
datagram protocol and it's potentially going to be dropped in transit
later.

Of course, now I'm looking closely at the path these packets take to
leave the box, it starts to offend me that they're being passed up to
userspace just to encrypt them (as DTLS or ESP) and then send them back
down to the kernel on a UDP socket. The kernel already knows how to
{en,de}crypt ESP, and do the sequence number checking on incoming
packets.

I'm wondering if we bypass userspace in that case somehow — let
userspace negotiate the encryption and connect the UDP socket, then just
pass the socket fd and the parameters to the kernel so that incoming
packets are decrypted and 'received' on the tun device, and outgoing
packets on the tun device are encrypted and sent out on the UDP socket. 

The performance isn't too much of an issue for a VPN *client* in
practice, but we have a server implementation too which would probably
benefit quite well from such an offload facility.

If I were to look at such a thing, would it provoke screams of horror?
David Miller Feb. 1, 2015, 8:19 p.m. UTC | #19
From: David Woodhouse <dwmw2@infradead.org>
Date: Sun, 01 Feb 2015 13:33:50 +0000

> Of course, now I'm looking closely at the path these packets take to
> leave the box, it starts to offend me that they're being passed up to
> userspace just to encrypt them (as DTLS or ESP) and then send them back
> down to the kernel on a UDP socket. The kernel already knows how to
> {en,de}crypt ESP, and do the sequence number checking on incoming
> packets.

It's funny, I thought we had an IPSEC stack....
David Woodhouse Feb. 1, 2015, 9:29 p.m. UTC | #20
On Sun, 2015-02-01 at 12:19 -0800, David Miller wrote:
> From: David Woodhouse <dwmw2@infradead.org>
> Date: Sun, 01 Feb 2015 13:33:50 +0000
> 
> > Of course, now I'm looking closely at the path these packets take to
> > leave the box, it starts to offend me that they're being passed up to
> > userspace just to encrypt them (as DTLS or ESP) and then send them back
> > down to the kernel on a UDP socket. The kernel already knows how to
> > {en,de}crypt ESP, and do the sequence number checking on incoming
> > packets.
> 
> It's funny, I thought we had an IPSEC stack....

Right. But I'm trying to work out how we can sanely *use* that from a
VPN client.

The client normally sets up a tun device, configuring it with
appropriate IP addresses and routes by invoking vpnc-script or passing
the information back to NetworkManager. The client itself might not even
have root privs, in the NetworkManager case.

The initial authentication and connection are done over HTTPS, and
packets *can* be passed that way if they need to be. But obviously the
client *also* tries to set up a UDP data transport too — which is DTLS
in the case of Cisco AnyConnect, and ESP in UDP for Juniper.

If it *can* get communication over UDP, it'll use it. Otherwise it just
passes packets over the TCP connection. So it needs to dynamically set
up and tear down the ESP/DTLS tunnels as and when they are working.

Ideally we want it such that that packets routed to the tun device get
transparently encrypted and sent out on the UDP socket, and packets
received from UDP and successfully decrypted will appear to have arrived
on the tun device. The user may be manually tweaking the routing, or
setting up firewall/NAT/etc. on the tun device.

I can see how to set up an ESP in UDP tunnel such that it looks like the
packets are actually departing on the *physical* interface (which in
practice I suppose they are). But that's going to be fairly complex to
set up, and extremely non-intuitive and hard to manage for the user. To
the extent that I don't think it's actually deployable.

I really was looking for some way to push down something like an XFRM
state into the tun device and just say "shove them out here until I tell
you otherwise".
David Miller Feb. 2, 2015, 5:07 a.m. UTC | #21
From: David Woodhouse <dwmw2@infradead.org>
Date: Sun, 01 Feb 2015 21:29:43 +0000

> I really was looking for some way to push down something like an XFRM
> state into the tun device and just say "shove them out here until I tell
> you otherwise".

People decided to use TUN and push VPN stuff back into userspace,
and there are repercussions for that decision.

I'm not saying this to be mean or whatever, but I was very
disappointed when userland IPSEC solutions using TUN started showing
up.

We might as well have not have implemented the IPSEC stack at all,
because as a result of the userland VPN stuff our IPSEC stack is
largely unused except by a very narrow group of users.
David Woodhouse Feb. 2, 2015, 7:27 a.m. UTC | #22
On Sun, 2015-02-01 at 21:07 -0800, David Miller wrote:
> From: David Woodhouse <dwmw2@infradead.org>
> Date: Sun, 01 Feb 2015 21:29:43 +0000
> 
> > I really was looking for some way to push down something like an XFRM
> > state into the tun device and just say "shove them out here until I tell
> > you otherwise".
> 
> People decided to use TUN and push VPN stuff back into userspace,
> and there are repercussions for that decision.
> 
> I'm not saying this to be mean or whatever, but I was very
> disappointed when userland IPSEC solutions using TUN started showing
> up.

Yeah. That's a valid criticism of vpnc, certainly. I never did
understand why it reimplemented the IPSec stack.

For my OpenConnect client it's somewhat more justified though — the
initial data transport there is over TLS, which the kernel doesn't
support. And if we *can* establish UDP communication, that's over DTLS
which the kernel doesn't support either. It's not even the *standard*
version of DTLS because Cisco are still using a pre-RFC4347 version of
the protocol. And we also need to probe the UDP connectivity and do
keepalives and manage the fallback to using the TCP data transport.

It's not like vpnc where it really is just a case of setting up the ESP
context and letting it run.

It's only now I've added Juniper support, which uses ESP-in-UDP for the
data transport, that I'm doing something that the kernel supports at
all. And now I'm looking at how to make use of that.

> We might as well have not have implemented the IPSEC stack at all,
> because as a result of the userland VPN stuff our IPSEC stack is
> largely unused except by a very narrow group of users.

Well, I'd love to make better use of it if I can. I do suspect it makes
most sense for userspace to continue to manage the probing of UDP
connectivity, and the fallback to TCP mode — and I suspect it also makes
sense to continue to use tun for passing packets up to the VPN client
when it's using the TCP transport.

So the question would be how we handle redirecting the packet flow to
the optional UDP transport, when the VPN client determines that it's
available. For the sake of the user setting up firewall and routing
rules, I do think it's important that it continues to appear to
userspace as the *same* device for the entire lifetime of the session,
regardless of which transport the packets happen to be using at a given
moment in time. It doesn't *have* to be tun, though. 

You don't seem to like my suggestion of somehow pushing down an XFRM
state to the tun device to direct the packets out there instead of up to
userspace. Do you have an alternative suggestion... or a specific
concern that would help me come up with something you like better?

I'm guessing you don't want to push the *whole* management of the TLS
control connection *and* the UDP transport, and probing the latter with
keepalives, into the kernel? I certainly don't :)
Steffen Klassert Feb. 2, 2015, 8:24 a.m. UTC | #23
On Mon, Feb 02, 2015 at 07:27:10AM +0000, David Woodhouse wrote:
> On Sun, 2015-02-01 at 21:07 -0800, David Miller wrote:
> 
> > We might as well have not have implemented the IPSEC stack at all,
> > because as a result of the userland VPN stuff our IPSEC stack is
> > largely unused except by a very narrow group of users.
> 
> Well, I'd love to make better use of it if I can. I do suspect it makes
> most sense for userspace to continue to manage the probing of UDP
> connectivity, and the fallback to TCP mode — and I suspect it also makes
> sense to continue to use tun for passing packets up to the VPN client
> when it's using the TCP transport.
> 
> So the question would be how we handle redirecting the packet flow to
> the optional UDP transport, when the VPN client determines that it's
> available. For the sake of the user setting up firewall and routing
> rules, I do think it's important that it continues to appear to
> userspace as the *same* device for the entire lifetime of the session,
> regardless of which transport the packets happen to be using at a given
> moment in time. It doesn't *have* to be tun, though. 
> 
> You don't seem to like my suggestion of somehow pushing down an XFRM
> state to the tun device to direct the packets out there instead of up to
> userspace. Do you have an alternative suggestion... or a specific
> concern that would help me come up with something you like better?

Maybe you want to use a virtual tunnel interface (vti) what we have
already. Everything that is routed through such an interface is
guaranteed to be either encrypted if a matching xfrm state is present
or dropped. Same on the rceive side, everything that is received by
this interface is guaranteed to be IPsec processed. So you can do
a routing based decision about the IPsec processing.

While I'm sure it could handle the ESP in UDP encapsulation, I'm not that
sure about your TCP fallback because this requires a valid xfrm state
to allow packets to pass. Using the same interface for both is probably
not possible.
Phil Sutter Feb. 2, 2015, 3:23 p.m. UTC | #24
Hi,

On Mon, Feb 02, 2015 at 07:27:10AM +0000, David Woodhouse wrote:
> On Sun, 2015-02-01 at 21:07 -0800, David Miller wrote:
> > From: David Woodhouse <dwmw2@infradead.org>
> > Date: Sun, 01 Feb 2015 21:29:43 +0000
> > 
> > > I really was looking for some way to push down something like an XFRM
> > > state into the tun device and just say "shove them out here until I tell
> > > you otherwise".
> > 
> > People decided to use TUN and push VPN stuff back into userspace,
> > and there are repercussions for that decision.
> > 
> > I'm not saying this to be mean or whatever, but I was very
> > disappointed when userland IPSEC solutions using TUN started showing
> > up.
> 
> Yeah. That's a valid criticism of vpnc, certainly. I never did
> understand why it reimplemented the IPSec stack.
> 
> For my OpenConnect client it's somewhat more justified though — the
> initial data transport there is over TLS, which the kernel doesn't
> support. And if we *can* establish UDP communication, that's over DTLS
> which the kernel doesn't support either. It's not even the *standard*
> version of DTLS because Cisco are still using a pre-RFC4347 version of
> the protocol. And we also need to probe the UDP connectivity and do
> keepalives and manage the fallback to using the TCP data transport.
> 
> It's not like vpnc where it really is just a case of setting up the ESP
> context and letting it run.
> 
> It's only now I've added Juniper support, which uses ESP-in-UDP for the
> data transport, that I'm doing something that the kernel supports at
> all. And now I'm looking at how to make use of that.
> 
> > We might as well have not have implemented the IPSEC stack at all,
> > because as a result of the userland VPN stuff our IPSEC stack is
> > largely unused except by a very narrow group of users.
> 
> Well, I'd love to make better use of it if I can. I do suspect it makes
> most sense for userspace to continue to manage the probing of UDP
> connectivity, and the fallback to TCP mode — and I suspect it also makes
> sense to continue to use tun for passing packets up to the VPN client
> when it's using the TCP transport.
> 
> So the question would be how we handle redirecting the packet flow to
> the optional UDP transport, when the VPN client determines that it's
> available. For the sake of the user setting up firewall and routing
> rules, I do think it's important that it continues to appear to
> userspace as the *same* device for the entire lifetime of the session,
> regardless of which transport the packets happen to be using at a given
> moment in time. It doesn't *have* to be tun, though. 

Since you want to provide connectivity over HTTPS which is not possible
in kernel space, you are stuck with keeping the tun device. So the
packet flow in that case is identical to how e.g. OpenVPN does it:

- tunX holds default route
- OpenConnect then:
  - receives packets on /dev/tun
  - holds TCP socket to VPN concentrator
  - does encapsulation into TLS

Speaking of optimisation, the interesting part is the alternative flow
via IPsec in UDP. AFAICT, it should be possible to setup an ESP in UDP
tunnel using XFRM (see ip-xfrm(8) for reference), although I didn't try
that myself. The funny thing with XFRM is, it applies before the routing
decision does: If my IPsec policy matches, the packet goes that way no
matter what the routing table says about the original destination. This
can be used to override the default route provided via tun0 in the above
case.

Of course, OpenConnect has to manage all the XFRM/policy stuff on it's
own, since switching from ESP in UDP back to TLS would mean to tear down
the XFRM tunnel. OpenConnect would have to setup (a limited) XFRM and
send test traffic to decide whether to set it up fully (if limited) or
tear it down (if unlimited) again so traffic arrives at tunX again.

In my opinion, this might work. The whole setup is probably about as
intuitive as the fact that kernel IPsec tunnel mode does not naturally
provide an own interface. Firewall setup on top of that might become a
matter of try-and-error. Maybe having a VTI interface and merely moving
the default route instead of fiddling with policies all the time might
make things a little easier to comprehend, but surely adds some
performance overhead.

Cheers, Phil
David Woodhouse Feb. 2, 2015, 3:30 p.m. UTC | #25
On Mon, 2015-02-02 at 09:24 +0100, Steffen Klassert wrote:
> 
> Maybe you want to use a virtual tunnel interface (vti) what we have
> already. Everything that is routed through such an interface is
> guaranteed to be either encrypted if a matching xfrm state is present
> or dropped. Same on the rceive side, everything that is received by
> this interface is guaranteed to be IPsec processed. So you can do
> a routing based decision about the IPsec processing.
> 
> While I'm sure it could handle the ESP in UDP encapsulation, I'm not that
> sure about your TCP fallback because this requires a valid xfrm state
> to allow packets to pass. Using the same interface for both is probably
> not possible.

I'm trying to imagine how we could make it work in practice if we end up
exposing two *different* interfaces and having to change the kernel's
routing according to whether we have UDP connectivity at any given
moment in time.

Given how painful it already is to maintain vpnc-script and make it do
the right thing for split-include and split-exclude routing, I'm not
really sure I want to go there.

Even if we could get such a scheme to work, it would probably also
require retaining root privileges to make the changes — and one of the
security benefits over the proprietary VPN clients is that we don't
*need* to run as root. We can either drop privs after running
vpnc-script to do the initial routing setup, or in the NetworkManager
case we *never* run with elevated privileges; we just pass the
IP/routing information back over DBus to NetworkManager.

It occurs to me that for the approach I was thinking about, I wouldn't
even need to touch the internals of the tun driver. It could be a
separate driver which just uses tun_get_socket(). Userspace could hand
it the file descriptors of the tun device and the connected UDP socket,
along with the encryption parameters — and then just stop reading
packets from the tun device for itself.
David Woodhouse Feb. 2, 2015, 3:47 p.m. UTC | #26
On Mon, 2015-02-02 at 16:23 +0100, Phil Sutter wrote:
> Since you want to provide connectivity over HTTPS which is not possible
> in kernel space, you are stuck with keeping the tun device. So the
> packet flow in that case is identical to how e.g. OpenVPN does it:
> 
> - tunX holds default route
> - OpenConnect then:
>   - receives packets on /dev/tun
>   - holds TCP socket to VPN concentrator
>   - does encapsulation into TLS
> 
> Speaking of optimisation, the interesting part is the alternative flow
> via IPsec in UDP.

Right. The packet flow you describe is what we already have. Except of
course we already *do* establish the UDP connection (which is DTLS when
we're talking to a Cisco AnyConnect server, and ESP in UDP when we're
talking to Juniper). If we get responses to keepalive packets, we'll
send outbound packets over the UDP connection. If the UDP connectivity
goes AWOL, we'll fall back to sending on TCP. Rekeying of the UDP
connection is handled over the TCP control connection too. Even in the
DTLS case, the master secret and session ID are exchanged over TCP and
the DTLS is actually done as a 'session resume', without the normal DTLS
handshake ever happening.

As you say, I'm stuck with keeping the tun device (or something very
much like it). This *isn't* like vpnc where I can set up an IPSec config
and just let it run.

>  AFAICT, it should be possible to setup an ESP in UDP
> tunnel using XFRM (see ip-xfrm(8) for reference), although I didn't try
> that myself. The funny thing with XFRM is, it applies before the routing
> decision does: If my IPsec policy matches, the packet goes that way no
> matter what the routing table says about the original destination. This
> can be used to override the default route provided via tun0 in the above
> case.

Except it isn't even the default route. We get given a bunch of split
includes or split excludes from the VPN server. We pass them to
vpnc-script or NetworkManager to actually set the routes up, and those
tools may make their own tweaks to what the server requested — denying
the default route and setting up explicit routes, or adding firewall
rules or NAT to incoming/outgoing packets on the tun device.

If it is no longer *just* the single tun device, everything gets really
complicated. Even *before* we talk about changing it on the fly during
normal operation.

> Of course, OpenConnect has to manage all the XFRM/policy stuff on it's
> own, since switching from ESP in UDP back to TLS would mean to tear down
> the XFRM tunnel. OpenConnect would have to setup (a limited) XFRM and
> send test traffic to decide whether to set it up fully (if limited) or
> tear it down (if unlimited) again so traffic arrives at tunX again.

Right. And ideally without CAP_NET_ADMIN.

> In my opinion, this might work. The whole setup is probably about as
> intuitive as the fact that kernel IPsec tunnel mode does not naturally
> provide an own interface. Firewall setup on top of that might become a
> matter of try-and-error. Maybe having a VTI interface and merely moving
> the default route instead of fiddling with policies all the time might
> make things a little easier to comprehend, but surely adds some
> performance overhead.

I think even the latter is sufficiently complex to manage that it's not
worth pursuing. I may throw together my suggested hack using
tun_get_socket() and see how much it makes *me* barf before deciding
whether to show it here for more feedback :)
David Miller Feb. 4, 2015, 12:19 a.m. UTC | #27
From: David Woodhouse <dwmw2@infradead.org>
Date: Mon, 02 Feb 2015 07:27:10 +0000

> I'm guessing you don't want to push the *whole* management of the TLS
> control connection *and* the UDP transport, and probing the latter with
> keepalives, into the kernel? I certainly don't :)

Whilst Herbert Xu and I have discussed in the past supporting
automatic SSL handling of socket data during socket writes in the
kernel, doing TLS stuff would be a bit of a stretch :-)
David Woodhouse Feb. 4, 2015, 6:35 a.m. UTC | #28
On Tue, 2015-02-03 at 16:19 -0800, David Miller wrote:
> From: David Woodhouse <dwmw2@infradead.org>
> Date: Mon, 02 Feb 2015 07:27:10 +0000
> 
> > I'm guessing you don't want to push the *whole* management of the TLS
> > control connection *and* the UDP transport, and probing the latter with
> > keepalives, into the kernel? I certainly don't :)
> 
> Whilst Herbert Xu and I have discussed in the past supporting
> automatic SSL handling of socket data during socket writes in the
> kernel, doing TLS stuff would be a bit of a stretch :-)

Right. For the DTLS I was thinking we'd do the handshake in userspace
and then hand the UDP socket down. At that point it's basically the same
as ESP with the bytes in a slightly different place.

So I really am looking at an option for "here's a UDP socket to send
those tun packets out on, with <this> encryption setup" as the sanest
plan I can come up with.
diff mbox

Patch

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 96c39bd..4326520 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -387,6 +387,10 @@  static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 		}
 	}
 
+	/* Orphan the skb - required as we might hang on to it
+	 * for indefinite time. */
+	skb_orphan(skb);
+
 	/* Enqueue packet */
 	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
 	dev->trans_start = jiffies;