diff mbox

[RFC] net: remove erroneous sk null assignment in timestamping

Message ID 1318007501.3988.20.camel@jlt3.sipsolutions.net
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Johannes Berg Oct. 7, 2011, 5:11 p.m. UTC
From: Johannes Berg <johannes.berg@intel.com>

skb->sk is obviously required to be non-NULL
when we get into skb_complete_tx_timestamp().
sock_queue_err_skb() will call skb_orphan()
first thing which sets skb->sk = NULL itself.
This may crash if the skb is still charged to
the socket (skb->destructor is sk_wfree).

The assignment here thus seems to not only be
pointless (due to the skb_orphan() call) but
also dangerous (due to the crash).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
---
 net/core/timestamping.c |    1 -
 1 file changed, 1 deletion(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Oct. 7, 2011, 5:33 p.m. UTC | #1
From: Johannes Berg <johannes@sipsolutions.net>
Date: Fri, 07 Oct 2011 19:11:41 +0200

> From: Johannes Berg <johannes.berg@intel.com>
> 
> skb->sk is obviously required to be non-NULL
> when we get into skb_complete_tx_timestamp().
> sock_queue_err_skb() will call skb_orphan()
> first thing which sets skb->sk = NULL itself.
> This may crash if the skb is still charged to
> the socket (skb->destructor is sk_wfree).
> 
> The assignment here thus seems to not only be
> pointless (due to the skb_orphan() call) but
> also dangerous (due to the crash).
> 
> Signed-off-by: Johannes Berg <johannes.berg@intel.com>

It looks like skb_clone_tx_timestamp() sets clone->sk without any
proper refcounting, so I bet this NULL'ing it out is working
around that bug.

The TX side of this infrastructure seems very poorly tested.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Oct. 7, 2011, 5:40 p.m. UTC | #2
On Fri, 2011-10-07 at 13:33 -0400, David Miller wrote:
> From: Johannes Berg <johannes@sipsolutions.net>
> Date: Fri, 07 Oct 2011 19:11:41 +0200
> 
> > From: Johannes Berg <johannes.berg@intel.com>
> > 
> > skb->sk is obviously required to be non-NULL
> > when we get into skb_complete_tx_timestamp().
> > sock_queue_err_skb() will call skb_orphan()
> > first thing which sets skb->sk = NULL itself.
> > This may crash if the skb is still charged to
> > the socket (skb->destructor is sk_wfree).
> > 
> > The assignment here thus seems to not only be
> > pointless (due to the skb_orphan() call) but
> > also dangerous (due to the crash).
> > 
> > Signed-off-by: Johannes Berg <johannes.berg@intel.com>
> 
> It looks like skb_clone_tx_timestamp() sets clone->sk without any
> proper refcounting, so I bet this NULL'ing it out is working
> around that bug.

You're right. But this bug can actually trigger another way: The only
user of this is dp83640_txtstamp() which might do kfree_skb() on this
skb, in which case that'll likely crash trying to clean up the sk.

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Oct. 7, 2011, 5:47 p.m. UTC | #3
On Fri, 2011-10-07 at 19:40 +0200, Johannes Berg wrote:

> > It looks like skb_clone_tx_timestamp() sets clone->sk without any
> > proper refcounting, so I bet this NULL'ing it out is working
> > around that bug.
> 
> You're right. But this bug can actually trigger another way: The only
> user of this is dp83640_txtstamp() which might do kfree_skb() on this
> skb, in which case that'll likely crash trying to clean up the sk.

Not only that -- I don't even see what causes the socket to not go away
while we're waiting for the ts_work that dp83640_txtstamp() fires off to
run.

And actually, no, this is all wrong. It shouldn't kick off ts_work
anyway, ts_work only handles *RX* timestamps, never *TX*. The *TX*
timestamps are handled by decode_txts() which is called when we receive
a TXTS frame from the device... *that* dequeues the skb from tx_queue
there.

I'm afraid as is this is terminally broken. I don't have a device with a
dp83640 PHY, but I bet you can cause kernel crashes by doing something
like

while (1)
	fd = open()
	enable tx timestamping on fd;
	send(fd, frame)
	close(fd);

I don't even think any of this is privileged.

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Oct. 7, 2011, 5:53 p.m. UTC | #4
On Fri, 2011-10-07 at 19:47 +0200, Johannes Berg wrote:

> I'm afraid as is this is terminally broken. I don't have a device with a
> dp83640 PHY, but I bet you can cause kernel crashes by doing something
> like
> 
> while (1)
> 	fd = open()
> 	enable tx timestamping on fd;
> 	send(fd, frame)
> 	close(fd);

It's possible that it doesn't crash *if* (and only if!) the ethernet
driver is guaranteed to process the TXTS RX packet before freeing the
original SKB off the TX queue, and also never orphans or whatever the
original TX SKB. In that case the original TX SKB will hang on to the
socket for long enough I guess.

But that's not something I'd want to rely on. All it means that the
above code isn't an instant kernel crash. The code is still broken.

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Oct. 7, 2011, 6:42 p.m. UTC | #5
On Fri, 2011-10-07 at 19:40 +0200, Johannes Berg wrote:

> > It looks like skb_clone_tx_timestamp() sets clone->sk without any
> > proper refcounting, so I bet this NULL'ing it out is working
> > around that bug.
> 
> You're right. But this bug can actually trigger another way: The only
> user of this is dp83640_txtstamp() which might do kfree_skb() on this
> skb, in which case that'll likely crash trying to clean up the sk.

Maybe that's how you can trigger it: have one thread turn on and off
timestamping all the time, and another thread send frames all the time,
then eventually you'll probably run into the kfree_skb() case there. If
you ever manage to run into that case, it'll crash either when freeing
this skb or when freeing the original.

Anyway, it's broken. I'll stop thinking about it. You (Richard) should
fix it quickly though otherwise I think we should revert/delete this
code.

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Richard Cochran Oct. 8, 2011, 7:57 a.m. UTC | #6
On Fri, Oct 07, 2011 at 01:33:56PM -0400, David Miller wrote:
> It looks like skb_clone_tx_timestamp() sets clone->sk without any
> proper refcounting, so I bet this NULL'ing it out is working
> around that bug.

I don't remember why I put it that way, but I took a look at the
problem, and I am not sure how to solve it. The other callers of
sock_queue_err_skb all create or clone the error skb immediately
before queueing it:

  net/core/skbuff.c:       skb_tstamp_tx
  net/ipv4/ip_sockglue.c:  ip_icmp_error, ip_local_error
  net/ipv6/datagram.c:     ipv6_icmp_error, ipv6_local_error

So I need to prevent the socket from disappearing between
skb_clone_tx_timestamp and skb_complete_tx_timestamp:

  skb_clone_tx_timestamp
	clone = skb_clone(skb, GFP_ATOMIC);
	sock_hold
  skb_complete_tx_timestamp
	sock_queue_err_skb(sk, skb);
	sock_put

What do you think?

BTW, while looking for a good pattern to follow, I found that the can
driver also sets skb->sk after clone with no special treatment, like
so:

  drivers/net/can/dev.c:285
	can_put_echo_skb
		struct sock *srcsk = skb->sk;
		skb = skb_clone(old_skb, GFP_ATOMIC);
		skb->sk = srcsk;

> The TX side of this infrastructure seems very poorly tested.

In fact, we do have the phyter driver used in an extensive automated
test farm, but the applications just don't do the kinds of things
suggested to trigger the problem. The normal pattern is, send event
packet, get tx timestamp, and so we haven't seen the bug at all.

Thanks,
Richard
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Richard Cochran Oct. 8, 2011, 7:59 a.m. UTC | #7
On Fri, Oct 07, 2011 at 08:42:12PM +0200, Johannes Berg wrote:
> Maybe that's how you can trigger it: have one thread turn on and off
> timestamping all the time, and another thread send frames all the time,
> then eventually you'll probably run into the kfree_skb() case there. If
> you ever manage to run into that case, it'll crash either when freeing
> this skb or when freeing the original.

Thats one weird app, but I get the point, and thanks for your
attention to my code.

Richard
 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Oct. 8, 2011, 8:16 a.m. UTC | #8
On Sat, 2011-10-08 at 09:57 +0200, Richard Cochran wrote:

> I don't remember why I put it that way, but I took a look at the
> problem, and I am not sure how to solve it. The other callers of
> sock_queue_err_skb all create or clone the error skb immediately
> before queueing it:
> 
>   net/core/skbuff.c:       skb_tstamp_tx
>   net/ipv4/ip_sockglue.c:  ip_icmp_error, ip_local_error
>   net/ipv6/datagram.c:     ipv6_icmp_error, ipv6_local_error

Yeah, I noticed that too. That's also the reason they pass the socket
externally I believe, since it's not a properly refcounted socket (the
reference they use is still from the original skb).

The thing that makes it work is that
 a) they don't release the original SKB before sock_queue_err_skb() and
 b) skb->sk is NULL for them

Since this is just a single function, they can guarantee that -- in the
case we found here it's scattered across the code and won't always be
guaranteed -- e.g. the kfree_skb() case in the PHY driver potentially
violates b).

> So I need to prevent the socket from disappearing between
> skb_clone_tx_timestamp and skb_complete_tx_timestamp:
> 
>   skb_clone_tx_timestamp
> 	clone = skb_clone(skb, GFP_ATOMIC);
> 	sock_hold
>   skb_complete_tx_timestamp
> 	sock_queue_err_skb(sk, skb);
> 	sock_put
> 
> What do you think?

I'm not terribly familiar with struct sock. Looking at it, I'm a bit
confused by skb_orphan() -- it doesn't put the sock reference. So are
sockets not refcounted for skbs in this way? They seem to use
sock_wfree() which does a bit more than this it seems, and I don't see
it using sk_refcnt anywhere so I'm a bit confused now.

> BTW, while looking for a good pattern to follow, I found that the can
> driver also sets skb->sk after clone with no special treatment, like
> so:
> 
>   drivers/net/can/dev.c:285
> 	can_put_echo_skb
> 		struct sock *srcsk = skb->sk;
> 		skb = skb_clone(old_skb, GFP_ATOMIC);
> 		skb->sk = srcsk;

Yeah that looks fishy too. But to me it looks a bit like it should
charge to the socket instead of refcounting it -- though of course
that's not really the correct thing to do from a socket buffer point of
view, but it seems the sk_refcnt and sk_wmem_alloc are two separate
mechanisms of refcounting the socket -- I just haven't figured out yet
how they interact.

> > The TX side of this infrastructure seems very poorly tested.
> 
> In fact, we do have the phyter driver used in an extensive automated
> test farm, but the applications just don't do the kinds of things
> suggested to trigger the problem. The normal pattern is, send event
> packet, get tx timestamp, and so we haven't seen the bug at all.

Makes sense, you never wrote an application trying to crash it :-)

> > Maybe that's how you can trigger it: have one thread turn on and off
> > timestamping all the time, and another thread send frames all the time,
> > then eventually you'll probably run into the kfree_skb() case there. If
> > you ever manage to run into that case, it'll crash either when freeing
> > this skb or when freeing the original.
> 
> Thats one weird app, but I get the point, and thanks for your
> attention to my code.

Agree, it's obviously a specifically devised app to try to make it
crash. It serves no other practical purpose.

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 8, 2011, 8:57 a.m. UTC | #9
Le samedi 08 octobre 2011 à 10:16 +0200, Johannes Berg a écrit :

> I'm not terribly familiar with struct sock. Looking at it, I'm a bit
> confused by skb_orphan() -- it doesn't put the sock reference. So are
> sockets not refcounted for skbs in this way? They seem to use
> sock_wfree() which does a bit more than this it seems, and I don't see
> it using sk_refcnt anywhere so I'm a bit confused now.

Check following commit changelog to get some information on this.

commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Thu Jun 11 02:55:43 2009 -0700

    net: No more expensive sock_hold()/sock_put() on each tx
    
    One of the problem with sock memory accounting is it uses
    a pair of sock_hold()/sock_put() for each transmitted packet.
    
    This slows down bidirectional flows because the receive path
    also needs to take a refcount on socket and might use a different
    cpu than transmit path or transmit completion path. So these
    two atomic operations also trigger cache line bounces.
    
    We can see this in tx or tx/rx workloads (media gateways for example),
    where sock_wfree() can be in top five functions in profiles.
    
    We use this sock_hold()/sock_put() so that sock freeing
    is delayed until all tx packets are completed.
    
    As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
    by one unit at init time, until sk_free() is called.
    Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
    to decrement initial offset and atomicaly check if any packets
    are in flight.
    
    skb_set_owner_w() doesnt call sock_hold() anymore
    
    sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
    reached 0 to perform the final freeing.
    
    Drawback is that a skb->truesize error could lead to unfreeable sockets, or
    even worse, prematurely calling __sk_free() on a live socket.
    
    Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
    on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
    contention point. 5 % speedup on a UDP transmit workload (depends
    on number of flows), lowering TX completion cpu usage.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Oct. 8, 2011, 10:32 a.m. UTC | #10
On Sat, 2011-10-08 at 10:57 +0200, Eric Dumazet wrote:
> Le samedi 08 octobre 2011 à 10:16 +0200, Johannes Berg a écrit :
> 
> > I'm not terribly familiar with struct sock. Looking at it, I'm a bit
> > confused by skb_orphan() -- it doesn't put the sock reference. So are
> > sockets not refcounted for skbs in this way? They seem to use
> > sock_wfree() which does a bit more than this it seems, and I don't see
> > it using sk_refcnt anywhere so I'm a bit confused now.
> 
> Check following commit changelog to get some information on this.
> 
> commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
> Author: Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Thu Jun 11 02:55:43 2009 -0700
> 
>     net: No more expensive sock_hold()/sock_put() on each tx

Aha, thanks for the pointer!

>     As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
>     by one unit at init time, until sk_free() is called.

This is the trick! Neat. I see it now, now sk_free() makes sense to
me :-)

There's one thing I still miss though: It seems to me that if you have a
reference to a socket that has been sk_free()'ed (which is possible
since it might still have sk_wmem_alloc > 0) you can't sock_hold() that
socket. That feels a bit unexpected -- and might happen in the code
Richard just suggested.

Basically, while you can bump a reference you own via sock_hold() by
sk_wmem_alloc, as soon as sk_refcnt reaches 0 it must never go > 0 again
because that will have released the sk_wmem_alloc offset.

That can be fixed, but I'm not exactly sure what would be an efficient
way of doing it. Maybe by adding some flag that says whether or not the
sk_wmem_alloc offset is present, but that flag would have to be atomic
since I guess even this could race.

>     Drawback is that a skb->truesize error could lead to unfreeable sockets, or
>     even worse, prematurely calling __sk_free() on a live socket.

I was thinking about this yesterday as well. What we could do is wrap
all the skb truesize operations in inlines -- the regular ones like
skb_truesize_set()/add()/sub() would (depending on a debug Kconfig)
check that the skb isn't charged to a socket, and common sequences like
changing truesize and updating the sock could be wrapped into another
set of inlines that do both. Or something like that.

I was actually thinking about this for other reasons but then realised
that with the truesize bug check gone (which was really checking
something else) I didn't really need to worry any more.

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Richard Cochran Oct. 8, 2011, 10:35 a.m. UTC | #11
On Sat, Oct 08, 2011 at 10:57:18AM +0200, Eric Dumazet wrote:
> Le samedi 08 octobre 2011 à 10:16 +0200, Johannes Berg a écrit :
> 
> > I'm not terribly familiar with struct sock. Looking at it, I'm a bit
> > confused by skb_orphan() -- it doesn't put the sock reference. So are
> > sockets not refcounted for skbs in this way? They seem to use
> > sock_wfree() which does a bit more than this it seems, and I don't see
> > it using sk_refcnt anywhere so I'm a bit confused now.

Me, too.

> Check following commit changelog to get some information on this.

Thanks, Eric, that does help explain.

>     We use this sock_hold()/sock_put() so that sock freeing
>     is delayed until all tx packets are completed.
>     
>     As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
>     by one unit at init time, until sk_free() is called.
>     Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
>     to decrement initial offset and atomicaly check if any packets
>     are in flight.
>     
>     skb_set_owner_w() doesnt call sock_hold() anymore

So, if I understand, then I can solve my particular problem by doing:

* skb_clone_tx_timestamp
	clone = skb_clone(skb, GFP_ATOMIC);
	skb_set_owner_w(clone, sk)
// instead of
//	clone->sk = sk;
	phydev->drv->txtstamp(phydev, clone, type);

* skb_complete_tx_timestamp
	serr->ee.ee_errno = ENOMSG;
	serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
// just remove this:
//	skb->sk = NULL;
	err = sock_queue_err_skb(sk, skb);

The only problem(?) I see is that it violates the rules from sock.h,
quoted below. The cloned tx skb destined for the error queue would be
budgeted to sk_wmem_alloc while wait for the time stamp. But maybe we
can allow this?

from sock.h:
/*
 * Socket reference counting postulates.
 *
 * * Each user of socket SHOULD hold a reference count.
 * * Each access point to socket (an hash table bucket, reference from a list,
 *   running timer, skb in flight MUST hold a reference count.
                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                   /
BTW, no longer true.

 * * Packets, delivered from outside (from network or from another process)
 *   and enqueued on receive/error queues SHOULD NOT grab reference count,
 *   when they sit in queue. ~~~~~~~~~~~~~~~~~~~~~~~

Want to break/bend this rule.

Thanks,
Richard
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Richard Cochran Oct. 11, 2011, 1:34 p.m. UTC | #12
On Sat, Oct 08, 2011 at 12:32:15PM +0200, Johannes Berg wrote:
> On Sat, 2011-10-08 at 10:57 +0200, Eric Dumazet wrote:
> > Check following commit changelog to get some information on this.
> > 
> > commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
> > Author: Eric Dumazet <eric.dumazet@gmail.com>
> > Date:   Thu Jun 11 02:55:43 2009 -0700
> > 
> >     net: No more expensive sock_hold()/sock_put() on each tx
...
> There's one thing I still miss though: It seems to me that if you have a
> reference to a socket that has been sk_free()'ed (which is possible
> since it might still have sk_wmem_alloc > 0) you can't sock_hold() that
> socket. That feels a bit unexpected -- and might happen in the code
> Richard just suggested.

Yes, I have been trying to see how to solve this, but it looks like I
am out of luck. 

Even if I use skb_set_owner_w() in skb_clone_tx_timestamp(), still the
sock might go away during skb_orphan() in sock_queue_err_skb().

It is no good to take sock_hold() in skb_complete_tx_timestamp(),
since, as you point out, it might not be safe to call.

So, I wonder, when is it safe to call sock_hold?

Are the 101 odd callers protected against the situation where the last
sock_out() has already happened?

Thanks,

Richard
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/net/core/timestamping.c	2011-10-07 18:59:12.000000000 +0200
+++ b/net/core/timestamping.c	2011-10-07 19:07:06.000000000 +0200
@@ -85,7 +85,6 @@  void skb_complete_tx_timestamp(struct sk
 	memset(serr, 0, sizeof(*serr));
 	serr->ee.ee_errno = ENOMSG;
 	serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
-	skb->sk = NULL;
 	err = sock_queue_err_skb(sk, skb);
 	if (err)
 		kfree_skb(skb);