diff mbox

[1/1] net: hold sock reference while processing tx timestamps

Message ID 56185ca8a7dc0223031ca0f0996302cac1b497eb.1318444117.git.richard.cochran@omicron.at
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Richard Cochran Oct. 12, 2011, 6:36 p.m. UTC
The pair of functions,

 * skb_clone_tx_timestamp()
 * skb_complete_tx_timestamp()

were designed to allow timestamping in PHY devices. The first
function, called during the MAC driver's hard_xmit method, identifies
PTP protocol packets, clones them, and gives them to the PHY device
driver. The PHY driver may hold onto the packet and deliver it at a
later time using the second function, which adds the packet to the
socket's error queue.

As pointed out by Johannes, nothing prevents the socket from
disappearing while the cloned packet is sitting in the PHY driver
awaiting a timestamp. This patch fixes the issue by taking a reference
on the socket for each such packet. In addition, the comments
regarding the usage of these function are expanded to highlight the
rule that PHY drivers must use skb_complete_tx_timestamp() to release
the packet, in order to release the socket reference, too.

These functions first appeared in v2.6.36.

Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Cc: <stable@kernel.org>
---
 include/linux/phy.h     |    2 +-
 include/linux/skbuff.h  |    7 ++++++-
 net/core/timestamping.c |    7 ++++++-
 3 files changed, 13 insertions(+), 3 deletions(-)

Comments

Eric Dumazet Oct. 12, 2011, 7:25 p.m. UTC | #1
Le mercredi 12 octobre 2011 à 20:36 +0200, Richard Cochran a écrit :
> The pair of functions,
> 
>  * skb_clone_tx_timestamp()
>  * skb_complete_tx_timestamp()
> 
> were designed to allow timestamping in PHY devices. The first
> function, called during the MAC driver's hard_xmit method, identifies
> PTP protocol packets, clones them, and gives them to the PHY device
> driver. The PHY driver may hold onto the packet and deliver it at a
> later time using the second function, which adds the packet to the
> socket's error queue.
> 
> As pointed out by Johannes, nothing prevents the socket from
> disappearing while the cloned packet is sitting in the PHY driver
> awaiting a timestamp. This patch fixes the issue by taking a reference
> on the socket for each such packet. In addition, the comments
> regarding the usage of these function are expanded to highlight the
> rule that PHY drivers must use skb_complete_tx_timestamp() to release
> the packet, in order to release the socket reference, too.
> 
> These functions first appeared in v2.6.36.
> 
> Reported-by: Johannes Berg <johannes@sipsolutions.net>
> Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
> Cc: <stable@kernel.org>
> ---

Seems fine to me, thanks !

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Oct. 12, 2011, 7:27 p.m. UTC | #2
On Wed, 2011-10-12 at 20:36 +0200, Richard Cochran wrote:
> The pair of functions,
> 
>  * skb_clone_tx_timestamp()
>  * skb_complete_tx_timestamp()
> 
> were designed to allow timestamping in PHY devices. The first
> function, called during the MAC driver's hard_xmit method, identifies
> PTP protocol packets, clones them, and gives them to the PHY device
> driver. The PHY driver may hold onto the packet and deliver it at a
> later time using the second function, which adds the packet to the
> socket's error queue.
> 
> As pointed out by Johannes, nothing prevents the socket from
> disappearing while the cloned packet is sitting in the PHY driver
> awaiting a timestamp. This patch fixes the issue by taking a reference
> on the socket for each such packet. In addition, the comments
> regarding the usage of these function are expanded to highlight the
> rule that PHY drivers must use skb_complete_tx_timestamp() to release
> the packet, in order to release the socket reference, too.

This still needs a fix to the PHY driver right? It has a case that can
kfree_skb() the skb instead of passing it back to
complete_tx_timestamp().

> -	if (!hwtstamps)
> +	if (!hwtstamps) {
> +		sock_put(sk);
> +		kfree_skb(skb);
>  		return;
> +	}

Is that right w/o skb->sk = NULL?


The other thing I was wondering -- what if we just set truesize to 1 (we
don't have any truesize checks) and account the skb to the socket
normally. Not really a good way either though.

FWIW I just decided to do it the other way around in mac80211 -- keep
the original SKB that's charged to the socket for the error queue, and
use a clone to actually do the TX.

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 12, 2011, 7:52 p.m. UTC | #3
Le mercredi 12 octobre 2011 à 21:27 +0200, Johannes Berg a écrit :
> On Wed, 2011-10-12 at 20:36 +0200, Richard Cochran wrote:
> > The pair of functions,
> > 
> >  * skb_clone_tx_timestamp()
> >  * skb_complete_tx_timestamp()
> > 
> > were designed to allow timestamping in PHY devices. The first
> > function, called during the MAC driver's hard_xmit method, identifies
> > PTP protocol packets, clones them, and gives them to the PHY device
> > driver. The PHY driver may hold onto the packet and deliver it at a
> > later time using the second function, which adds the packet to the
> > socket's error queue.
> > 
> > As pointed out by Johannes, nothing prevents the socket from
> > disappearing while the cloned packet is sitting in the PHY driver
> > awaiting a timestamp. This patch fixes the issue by taking a reference
> > on the socket for each such packet. In addition, the comments
> > regarding the usage of these function are expanded to highlight the
> > rule that PHY drivers must use skb_complete_tx_timestamp() to release
> > the packet, in order to release the socket reference, too.
> 
> This still needs a fix to the PHY driver right? It has a case that can
> kfree_skb() the skb instead of passing it back to
> complete_tx_timestamp().
> 
> > -	if (!hwtstamps)
> > +	if (!hwtstamps) {
> > +		sock_put(sk);
> > +		kfree_skb(skb);
> >  		return;
> > +	}
> 
> Is that right w/o skb->sk = NULL?
> 
> 
> The other thing I was wondering -- what if we just set truesize to 1 (we
> don't have any truesize checks) and account the skb to the socket
> normally. Not really a good way either though.
> 

Changing truesize is not allowed here see below.

> FWIW I just decided to do it the other way around in mac80211 -- keep
> the original SKB that's charged to the socket for the error queue, and
> use a clone to actually do the TX.

Hmm, please take a look at IFF_SKB_TX_SHARING stuff, added in commit
d88733150 (net: add IFF_SKB_TX_SHARED flag to priv_flags)





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Richard Cochran Oct. 13, 2011, 4:51 a.m. UTC | #4
On Wed, Oct 12, 2011 at 09:27:53PM +0200, Johannes Berg wrote:
> On Wed, 2011-10-12 at 20:36 +0200, Richard Cochran wrote:
> > The pair of functions,
> > 
> >  * skb_clone_tx_timestamp()
> >  * skb_complete_tx_timestamp()
...
> 
> This still needs a fix to the PHY driver right? It has a case that can
> kfree_skb() the skb instead of passing it back to
> complete_tx_timestamp().

Yes, right, will do.

> > -	if (!hwtstamps)
> > +	if (!hwtstamps) {
> > +		sock_put(sk);
> > +		kfree_skb(skb);
> >  		return;
> > +	}
> 
> Is that right w/o skb->sk = NULL?

I think it is okay. In clone/complete, we use skb->sk in a special
way, just to remember the socket. Within kfree_skb, only the
skb->destructor would possibly use skb->sk, and that function pointer
is NULL (after the clone in the Tx path).

> FWIW I just decided to do it the other way around in mac80211 -- keep
> the original SKB that's charged to the socket for the error queue, and
> use a clone to actually do the TX.

In this case, we must clone, I think. The Ethernet MAC driver will
call the clone_tx_timetstamp hook just before freeing the original
skb.

Thanks,
Richard
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Oct. 13, 2011, 8:54 a.m. UTC | #5
On Wed, 2011-10-12 at 21:52 +0200, Eric Dumazet wrote:

> > The other thing I was wondering -- what if we just set truesize to 1 (we
> > don't have any truesize checks) and account the skb to the socket
> > normally. Not really a good way either though.
> > 
> 
> Changing truesize is not allowed here see below.

Oh, good point.

> > FWIW I just decided to do it the other way around in mac80211 -- keep
> > the original SKB that's charged to the socket for the error queue, and
> > use a clone to actually do the TX.
> 
> Hmm, please take a look at IFF_SKB_TX_SHARING stuff, added in commit
> d88733150 (net: add IFF_SKB_TX_SHARED flag to priv_flags)

mac80211 got that flag cleared -- and right now I see no way to change
that. 11ac/ad may require work in this area to be able to do this
better, and I have a number of ideas on how to do that, but right now I
don't see how we can do that.

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Richard Cochran Oct. 13, 2011, 9:46 a.m. UTC | #6
The first patch fixes a bug in the time stamping code introduced in
v2.6.36 of the kernel.

The other two patches depend on the first patch and fix two bugs in a
PTP Hardware Clock driver. This driver was first introduced in Linux
version 3.0.

Richard Cochran (3):
  net: hold sock reference while processing tx timestamps
  dp83640: use proper function to free transmit time stamping packets
  dp83640: free packet queues on remove

 drivers/net/phy/dp83640.c |   11 +++++++++--
 include/linux/phy.h       |    2 +-
 include/linux/skbuff.h    |    7 ++++++-
 net/core/timestamping.c   |    7 ++++++-
 4 files changed, 22 insertions(+), 5 deletions(-)
David Miller Oct. 19, 2011, 4:16 a.m. UTC | #7
From: Richard Cochran <richardcochran@gmail.com>
Date: Thu, 13 Oct 2011 11:46:26 +0200

> The first patch fixes a bug in the time stamping code introduced in
> v2.6.36 of the kernel.
> 
> The other two patches depend on the first patch and fix two bugs in a
> PTP Hardware Clock driver. This driver was first introduced in Linux
> version 3.0.
> 
> Richard Cochran (3):
>   net: hold sock reference while processing tx timestamps
>   dp83640: use proper function to free transmit time stamping packets
>   dp83640: free packet queues on remove

Johannes and Eric, please help review Richard's fixes here.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Oct. 19, 2011, 5:15 a.m. UTC | #8
On Wed, 2011-10-19 at 00:16 -0400, David Miller wrote:
> From: Richard Cochran <richardcochran@gmail.com>
> Date: Thu, 13 Oct 2011 11:46:26 +0200
> 
> > The first patch fixes a bug in the time stamping code introduced in
> > v2.6.36 of the kernel.
> > 
> > The other two patches depend on the first patch and fix two bugs in a
> > PTP Hardware Clock driver. This driver was first introduced in Linux
> > version 3.0.
> > 
> > Richard Cochran (3):
> >   net: hold sock reference while processing tx timestamps
> >   dp83640: use proper function to free transmit time stamping packets
> >   dp83640: free packet queues on remove
> 
> Johannes and Eric, please help review Richard's fixes here.

The only thing I'm not completely sure about is whether or not it is
permissible to sock_hold() at that point. I'm probably just missing
something, but: if sk_free() was called before hard_start_xmit() which
will call skb_clone_tx_timestamp(), can we really call sock_hold()?

The reason I ask is that sock_wfree() doesn't check sk_refcnt, so if it
is possible for sk_free() to have been called before hard_start_xmit(),
maybe because the packet was stuck on the qdisc for a while, the socket
won't be released (sk_free checks sk_wmem_alloc) but the sk_wfree() when
the original skb is freed will actually free the socket, invalidating
the clone's sk pointer *even though* we called sock_hold() right after
making the clone.

So what guarantees that sk_refcnt is still non-zero when we make the
clone? Alternatively, should sock_wfree() check sk_refcnt?

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Richard Cochran Oct. 19, 2011, 11:50 a.m. UTC | #9
On Wed, Oct 19, 2011 at 07:15:36AM +0200, Johannes Berg wrote:
> The only thing I'm not completely sure about is whether or not it is
> permissible to sock_hold() at that point. I'm probably just missing
> something, but: if sk_free() was called before hard_start_xmit() which
> will call skb_clone_tx_timestamp(), can we really call sock_hold()?
> 
> The reason I ask is that sock_wfree() doesn't check sk_refcnt, so if it
> is possible for sk_free() to have been called before hard_start_xmit(),
> maybe because the packet was stuck on the qdisc for a while, the socket
> won't be released (sk_free checks sk_wmem_alloc) but the sk_wfree() when
> the original skb is freed will actually free the socket, invalidating
> the clone's sk pointer *even though* we called sock_hold() right after
> making the clone.
> 
> So what guarantees that sk_refcnt is still non-zero when we make the
> clone?

In the non-qdisc path, the kernel is in a send() call, so the initial
reference taken in socket() is held.

I really don't know the qdisc code, whether it is somehow holding the
skb->sk indirectly or not.

Eric? David?

> Alternatively, should sock_wfree() check sk_refcnt?

That would presumably spoil the performance enhancement gained in
commit 2b85a34e.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 19, 2011, 12:33 p.m. UTC | #10
Le mercredi 19 octobre 2011 à 13:50 +0200, Richard Cochran a écrit :
> On Wed, Oct 19, 2011 at 07:15:36AM +0200, Johannes Berg wrote:
> > The only thing I'm not completely sure about is whether or not it is
> > permissible to sock_hold() at that point. I'm probably just missing
> > something, but: if sk_free() was called before hard_start_xmit() which
> > will call skb_clone_tx_timestamp(), can we really call sock_hold()?
> > 

This is not possible, or something is really broken.


> > The reason I ask is that sock_wfree() doesn't check sk_refcnt, so if it
> > is possible for sk_free() to have been called before hard_start_xmit(),
> > maybe because the packet was stuck on the qdisc for a while, the socket
> > won't be released (sk_free checks sk_wmem_alloc) but the sk_wfree() when
> > the original skb is freed will actually free the socket, invalidating
> > the clone's sk pointer *even though* we called sock_hold() right after
> > making the clone.
> > 
> > So what guarantees that sk_refcnt is still non-zero when we make the
> > clone?
> 
> In the non-qdisc path, the kernel is in a send() call, so the initial
> reference taken in socket() is held.
> 
> I really don't know the qdisc code, whether it is somehow holding the
> skb->sk indirectly or not.
> 
> Eric? David?

I dont really understand what's the problem, since sk_free() doesnt care
at all about sk_refcnt, but sk_wmem_alloc.

If one skb is in flight, and still linked to a socket, then this socket
cannot disappear, because this skb->truesize was accounted into
sk->sk_wmem_alloc

Of course, this point is valid as long as skb had not been orphaned.

And this is true 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 19, 2011, 12:38 p.m. UTC | #11
Le mercredi 19 octobre 2011 à 13:50 +0200, Richard Cochran a écrit :
> On Wed, Oct 19, 2011 at 07:15:36AM +0200, Johannes Berg wrote:
> > The only thing I'm not completely sure about is whether or not it is
> > permissible to sock_hold() at that point. I'm probably just missing
> > something, but: if sk_free() was called before hard_start_xmit() which
> > will call skb_clone_tx_timestamp(), can we really call sock_hold()?
> > 

This is not possible, or something is really broken. We specifically
dont skb_orphan(skb) if we know tx timestamping is enabled for this skb.

/*
 * Try to orphan skb early, right before transmission by the device.
 * We cannot orphan skb if tx timestamp is requested or the sk-reference
 * is needed on driver level for other reasons, e.g. see net/can/raw.c
 */
static inline void skb_orphan_try(struct sk_buff *skb)
{
        struct sock *sk = skb->sk;

        if (sk && !skb_shinfo(skb)->tx_flags) {
                /* skb_tx_hash() wont be able to get sk.
                 * We copy sk_hash into skb->rxhash
                 */
                if (!skb->rxhash)
                        skb->rxhash = sk->sk_hash;
                skb_orphan(skb);
        }
}


> > The reason I ask is that sock_wfree() doesn't check sk_refcnt, so if it
> > is possible for sk_free() to have been called before hard_start_xmit(),
> > maybe because the packet was stuck on the qdisc for a while, the socket
> > won't be released (sk_free checks sk_wmem_alloc) but the sk_wfree() when
> > the original skb is freed will actually free the socket, invalidating
> > the clone's sk pointer *even though* we called sock_hold() right after
> > making the clone.
> > 
> > So what guarantees that sk_refcnt is still non-zero when we make the
> > clone?
> 
> In the non-qdisc path, the kernel is in a send() call, so the initial
> reference taken in socket() is held.
> 
> I really don't know the qdisc code, whether it is somehow holding the
> skb->sk indirectly or not.
> 
> Eric? David?

I dont really understand what's the concern, since sk_free() doesnt care
at all about sk_refcnt, but sk_wmem_alloc.

void sk_free(struct sock *sk)
{
        /*
         * We subtract one from sk_wmem_alloc and can know if
         * some packets are still in some tx queue.
         * If not null, sock_wfree() will call __sk_free(sk) later
         */
        if (atomic_dec_and_test(&sk->sk_wmem_alloc))
                __sk_free(sk);
}
EXPORT_SYMBOL(sk_free);


If one skb is in flight, and still linked to a socket, then this socket
cannot disappear, because this skb->truesize was accounted into
sk->sk_wmem_alloc

Of course, this point is valid as long as skb had not been orphaned.

sk_refcnt can be 0, if user closed the socket, but socket wont disappear
as long as sk_wmem_alloc is not 0.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Oct. 19, 2011, 12:58 p.m. UTC | #12
On Wed, 2011-10-19 at 14:38 +0200, Eric Dumazet wrote:
> Le mercredi 19 octobre 2011 à 13:50 +0200, Richard Cochran a écrit :
> > On Wed, Oct 19, 2011 at 07:15:36AM +0200, Johannes Berg wrote:
> > > The only thing I'm not completely sure about is whether or not it is
> > > permissible to sock_hold() at that point. I'm probably just missing
> > > something, but: if sk_free() was called before hard_start_xmit() which
> > > will call skb_clone_tx_timestamp(), can we really call sock_hold()?
> > > 
> 
> This is not possible, or something is really broken. We specifically
> dont skb_orphan(skb) if we know tx timestamping is enabled for this skb.

Why can't sk_free() have been called? I'm not thinking of sock_wfree()
which can't have been called -- so the socket surely still exists
because skb->truesize is still accounted to it -- but what says
sk_refcnt hasn't reached 0 yet?

> /*
>  * Try to orphan skb early, right before transmission by the device.
>  * We cannot orphan skb if tx timestamp is requested or the sk-reference
>  * is needed on driver level for other reasons, e.g. see net/can/raw.c
>  */
> static inline void skb_orphan_try(struct sk_buff *skb)
> {
>         struct sock *sk = skb->sk;
> 
>         if (sk && !skb_shinfo(skb)->tx_flags) {
>                 /* skb_tx_hash() wont be able to get sk.
>                  * We copy sk_hash into skb->rxhash
>                  */
>                 if (!skb->rxhash)
>                         skb->rxhash = sk->sk_hash;
>                 skb_orphan(skb);
>         }
> }

Right.

> I dont really understand what's the concern, since sk_free() doesnt care
> at all about sk_refcnt, but sk_wmem_alloc.

Right.

> void sk_free(struct sock *sk)
[snip]

> If one skb is in flight, and still linked to a socket, then this socket
> cannot disappear, because this skb->truesize was accounted into
> sk->sk_wmem_alloc

This is undoubtedly true, I'm not disputing this.

> Of course, this point is valid as long as skb had not been orphaned.
> 
> sk_refcnt can be 0, if user closed the socket, but socket wont disappear
> as long as sk_wmem_alloc is not 0.

Not disputing this either. But you said sk_refcnt can be 0, so why can't
the following happen:

/* skb; skb->sk = sk; skb->destructor = sock_wfree; */

/* skb is on qdisc, some time passes */

sk_free(sk); /* user closed socket,
                sk->sk_refcnt reaches 0,
		sk->sk_wmem_alloc == skb->truesize,
		__sk_free not called, socket still lives,
		but no more +1 in sk_wmem_alloc */

/* some more time passes */

/* ethernet hard_start_xmit calls skb_clone_tx_timestamp() */
skb2 = skb_clone(skb);
skb2->sk = skb->sk;
sock_hold(skb->sk);

/* ethernet TX completion calls skb_free(skb) */
skb_free(skb):
  sock_wfree(skb); /* sk_wmem_alloc reaches 0,
                      __sk_free called DESPITE sk_refcnt > 0 */

/* later, in skb_complete_tx_timestamp() */
sock_put(sk);	/* KABOOM */


I just want to understand why this can't happen :-)

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 19, 2011, 1:21 p.m. UTC | #13
Le mercredi 19 octobre 2011 à 14:58 +0200, Johannes Berg a écrit :
> On Wed, 2011-10-19 at 14:38 +0200, Eric Dumazet wrote:
> > Le mercredi 19 octobre 2011 à 13:50 +0200, Richard Cochran a écrit :
> > > On Wed, Oct 19, 2011 at 07:15:36AM +0200, Johannes Berg wrote:
> > > > The only thing I'm not completely sure about is whether or not it is
> > > > permissible to sock_hold() at that point. I'm probably just missing
> > > > something, but: if sk_free() was called before hard_start_xmit() which
> > > > will call skb_clone_tx_timestamp(), can we really call sock_hold()?
> > > > 
> > 
> > This is not possible, or something is really broken. We specifically
> > dont skb_orphan(skb) if we know tx timestamping is enabled for this skb.
> 
> Why can't sk_free() have been called? I'm not thinking of sock_wfree()
> which can't have been called -- so the socket surely still exists
> because skb->truesize is still accounted to it -- but what says
> sk_refcnt hasn't reached 0 yet?
> 
> > /*
> >  * Try to orphan skb early, right before transmission by the device.
> >  * We cannot orphan skb if tx timestamp is requested or the sk-reference
> >  * is needed on driver level for other reasons, e.g. see net/can/raw.c
> >  */
> > static inline void skb_orphan_try(struct sk_buff *skb)
> > {
> >         struct sock *sk = skb->sk;
> > 
> >         if (sk && !skb_shinfo(skb)->tx_flags) {
> >                 /* skb_tx_hash() wont be able to get sk.
> >                  * We copy sk_hash into skb->rxhash
> >                  */
> >                 if (!skb->rxhash)
> >                         skb->rxhash = sk->sk_hash;
> >                 skb_orphan(skb);
> >         }
> > }
> 
> Right.
> 
> > I dont really understand what's the concern, since sk_free() doesnt care
> > at all about sk_refcnt, but sk_wmem_alloc.
> 
> Right.
> 
> > void sk_free(struct sock *sk)
> [snip]
> 
> > If one skb is in flight, and still linked to a socket, then this socket
> > cannot disappear, because this skb->truesize was accounted into
> > sk->sk_wmem_alloc
> 
> This is undoubtedly true, I'm not disputing this.
> 
> > Of course, this point is valid as long as skb had not been orphaned.
> > 
> > sk_refcnt can be 0, if user closed the socket, but socket wont disappear
> > as long as sk_wmem_alloc is not 0.
> 
> Not disputing this either. But you said sk_refcnt can be 0, so why can't
> the following happen:
> 
> /* skb; skb->sk = sk; skb->destructor = sock_wfree; */
> 
> /* skb is on qdisc, some time passes */
> 
> sk_free(sk); /* user closed socket,
>                 sk->sk_refcnt reaches 0,
> 		sk->sk_wmem_alloc == skb->truesize,
> 		__sk_free not called, socket still lives,
> 		but no more +1 in sk_wmem_alloc */
> 
> /* some more time passes */
> 
> /* ethernet hard_start_xmit calls skb_clone_tx_timestamp() */
> skb2 = skb_clone(skb);
> skb2->sk = skb->sk;
> sock_hold(skb->sk);
> 
> /* ethernet TX completion calls skb_free(skb) */
> skb_free(skb):
>   sock_wfree(skb); /* sk_wmem_alloc reaches 0,
>                       __sk_free called DESPITE sk_refcnt > 0 */
> 
> /* later, in skb_complete_tx_timestamp() */
> sock_put(sk);	/* KABOOM */
> 
> 
> I just want to understand why this can't happen :-)

Since you answer your own question :)

Hmm, oh well, sk_refcnt is/should not be changed if a xmit packet is
duped, but sk_wmem_alloc should be, exactly paired with skb->truesize



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Oct. 19, 2011, 1:25 p.m. UTC | #14
On Wed, 2011-10-19 at 15:21 +0200, Eric Dumazet wrote:

> > Not disputing this either. But you said sk_refcnt can be 0, so why can't
> > the following happen:
> > 
> > /* skb; skb->sk = sk; skb->destructor = sock_wfree; */
> > 
> > /* skb is on qdisc, some time passes */
> > 
> > sk_free(sk); /* user closed socket,
> >                 sk->sk_refcnt reaches 0,
> > 		sk->sk_wmem_alloc == skb->truesize,
> > 		__sk_free not called, socket still lives,
> > 		but no more +1 in sk_wmem_alloc */
> > 
> > /* some more time passes */
> > 
> > /* ethernet hard_start_xmit calls skb_clone_tx_timestamp() */
> > skb2 = skb_clone(skb);
> > skb2->sk = skb->sk;
> > sock_hold(skb->sk);
> > 
> > /* ethernet TX completion calls skb_free(skb) */
> > skb_free(skb):
> >   sock_wfree(skb); /* sk_wmem_alloc reaches 0,
> >                       __sk_free called DESPITE sk_refcnt > 0 */
> > 
> > /* later, in skb_complete_tx_timestamp() */
> > sock_put(sk);	/* KABOOM */
> > 
> > 
> > I just want to understand why this can't happen :-)
> 
> Since you answer your own question :)

Did I?

> Hmm, oh well, sk_refcnt is/should not be changed if a xmit packet is
> duped, but sk_wmem_alloc should be, exactly paired with skb->truesize

Well, yeah, ok, but Richard's patches do modify sk_refcnt, and can't
modify sk_wmem_alloc as discussed upthread.

I'll admit that I don't really trust the whole thing anyway -- it seems
rather error prone to forbid you from doing sock_hold() on a socket you
obtained from a TX packet.

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 19, 2011, 1:27 p.m. UTC | #15
Le mercredi 19 octobre 2011 à 15:25 +0200, Johannes Berg a écrit :

> Well, yeah, ok, but Richard's patches do modify sk_refcnt, and can't
> modify sk_wmem_alloc as discussed upthread.
> 
> I'll admit that I don't really trust the whole thing anyway -- it seems
> rather error prone to forbid you from doing sock_hold() on a socket you
> obtained from a TX packet.
> 

I probably missed why sk_wmem_alloc can not be modified ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Oct. 19, 2011, 1:32 p.m. UTC | #16
On Wed, 2011-10-19 at 15:27 +0200, Eric Dumazet wrote:
> Le mercredi 19 octobre 2011 à 15:25 +0200, Johannes Berg a écrit :
> 
> > Well, yeah, ok, but Richard's patches do modify sk_refcnt, and can't
> > modify sk_wmem_alloc as discussed upthread.
> > 
> > I'll admit that I don't really trust the whole thing anyway -- it seems
> > rather error prone to forbid you from doing sock_hold() on a socket you
> > obtained from a TX packet.
> > 
> 
> I probably missed why sk_wmem_alloc can not be modified ?

I think I misremember -- I was suggesting to set skb2->truesize to 1 and
then charge it so it's not double-charged.

OTOH, could skb_clone_tx_timestamp() orphan the original skb after
adding the skb2?

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Richard Cochran Oct. 19, 2011, 2:25 p.m. UTC | #17
On Wed, Oct 19, 2011 at 03:32:53PM +0200, Johannes Berg wrote:
> 
> OTOH, could skb_clone_tx_timestamp() orphan the original skb after
> adding the skb2?

The Ethernet MAC driver decides what to do with the original skb. In
the MAC driver, it goes like:

1. skb_tx_timestamp(skb) -> skb_clone_tx_timestamp(skb);
2. give skb to hardware;
3. ...
4. kfree_skb, recycle, etc

Richard

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Richard Cochran Oct. 21, 2011, 10:49 a.m. UTC | #18
[ Changes in v2: use atomic_inc_not_zero per Eric's suggestion. ]

The first patch fixes a bug in the time stamping code introduced in
v2.6.36 of the kernel.

The other two patches depend on the first patch and fix two bugs in a
PTP Hardware Clock driver. This driver was first introduced in Linux
version 3.0.

Richard Cochran (3):
  net: hold sock reference while processing tx timestamps
  dp83640: use proper function to free transmit time stamping packets
  dp83640: free packet queues on remove

 drivers/net/phy/dp83640.c |   11 +++++++++--
 include/linux/phy.h       |    2 +-
 include/linux/skbuff.h    |    7 ++++++-
 net/core/timestamping.c   |   12 ++++++++++--
 4 files changed, 26 insertions(+), 6 deletions(-)
diff mbox

Patch

diff --git a/include/linux/phy.h b/include/linux/phy.h
index 54fc413..79f337c 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -420,7 +420,7 @@  struct phy_driver {
 
 	/*
 	 * Requests a Tx timestamp for 'skb'. The phy driver promises
-	 * to deliver it to the socket's error queue as soon as a
+	 * to deliver it using skb_complete_tx_timestamp() as soon as a
 	 * timestamp becomes available. One of the PTP_CLASS_ values
 	 * is passed in 'type'.
 	 */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8bd383c..0f96646 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2020,8 +2020,13 @@  static inline bool skb_defer_rx_timestamp(struct sk_buff *skb)
 /**
  * skb_complete_tx_timestamp() - deliver cloned skb with tx timestamps
  *
+ * PHY drivers may accept clones of transmitted packets for
+ * timestamping via their phy_driver.txtstamp method. These drivers
+ * must call this function to return the skb back to the stack, with
+ * or without a timestamp.
+ *
  * @skb: clone of the the original outgoing packet
- * @hwtstamps: hardware time stamps
+ * @hwtstamps: hardware time stamps, may be NULL if not available
  *
  */
 void skb_complete_tx_timestamp(struct sk_buff *skb,
diff --git a/net/core/timestamping.c b/net/core/timestamping.c
index 98a5264..29e59d6 100644
--- a/net/core/timestamping.c
+++ b/net/core/timestamping.c
@@ -60,6 +60,7 @@  void skb_clone_tx_timestamp(struct sk_buff *skb)
 			clone = skb_clone(skb, GFP_ATOMIC);
 			if (!clone)
 				return;
+			sock_hold(sk);
 			clone->sk = sk;
 			phydev->drv->txtstamp(phydev, clone, type);
 		}
@@ -77,8 +78,11 @@  void skb_complete_tx_timestamp(struct sk_buff *skb,
 	struct sock_exterr_skb *serr;
 	int err;
 
-	if (!hwtstamps)
+	if (!hwtstamps) {
+		sock_put(sk);
+		kfree_skb(skb);
 		return;
+	}
 
 	*skb_hwtstamps(skb) = *hwtstamps;
 	serr = SKB_EXT_ERR(skb);
@@ -87,6 +91,7 @@  void skb_complete_tx_timestamp(struct sk_buff *skb,
 	serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
 	skb->sk = NULL;
 	err = sock_queue_err_skb(sk, skb);
+	sock_put(sk);
 	if (err)
 		kfree_skb(skb);
 }