diff mbox

[net-next,2/2] mac80211: Resolve sk_refcnt/sk_wmem_alloc issue in wifi ack path

Message ID 20140910220536.23225.92956.stgit@ahduyck-bv4.jf.intel.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Duyck, Alexander H Sept. 10, 2014, 10:05 p.m. UTC
There is a possible issue with the use, or lack thereof of sk_refcnt and
sk_wmem_alloc in the wifi ack status functionality.

Specifically if a socket were to request acknowledgements, and the socket
were to have sk_refcnt drop to 0 resulting in it waiting on sk_wmem_alloc
to reach 0 it would be possible to have sock_queue_err_skb orphan the last
buffer, resulting in __sk_free being called on the socket.  After this the
buffer is enqueued on sk_error_queue, however the queue has already been
flushed resulting in at least a memory leak, if not a data corruption.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 net/core/skbuff.c |    5 +++++
 net/mac80211/tx.c |   15 ++++-----------
 2 files changed, 9 insertions(+), 11 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Johannes Berg Sept. 11, 2014, 7:06 a.m. UTC | #1
On Wed, 2014-09-10 at 18:05 -0400, Alexander Duyck wrote:
> There is a possible issue with the use, or lack thereof of sk_refcnt and
> sk_wmem_alloc in the wifi ack status functionality.
> 
> Specifically if a socket were to request acknowledgements, and the socket
> were to have sk_refcnt drop to 0 resulting in it waiting on sk_wmem_alloc
> to reach 0 it would be possible to have sock_queue_err_skb orphan the last
> buffer, resulting in __sk_free being called on the socket.  After this the
> buffer is enqueued on sk_error_queue, however the queue has already been
> flushed resulting in at least a memory leak, if not a data corruption.

Oh. Thanks :-)

> +	/* take a reference to prevent skb_orphan() from freeing the socket */
> +	sock_hold(sk);
> +
>  	err = sock_queue_err_skb(sk, skb);
>  	if (err)
>  		kfree_skb(skb);
> +
> +	sock_put(sk);
>  }
>  EXPORT_SYMBOL_GPL(skb_complete_wifi_ack);

Here I'm not sure it matters *for this function*? Wouldn't it be freed
then in sock_put(), which has the same net effect on this function
overall? It doesn't use it after sock_queue_err_skb().

Seems like maybe this should be in sock_queue_err_skb() itself, since it
does the orphaning first and then looks at the socket. Or the
documentation for that function should state that it has to be held, but
there are plenty of callers?

>  			spin_lock_irqsave(&local->ack_status_lock, flags);
> -			id = idr_alloc(&local->ack_status_frames, orig_skb,
> +			id = idr_alloc(&local->ack_status_frames, ack_skb,
>  				       1, 0x10000, GFP_ATOMIC);
>  			spin_unlock_irqrestore(&local->ack_status_lock, flags);
>  
>  			if (id >= 0) {
>  				info_id = id;
>  				info_flags |= IEEE80211_TX_CTL_REQ_TX_STATUS;
> -			} else if (skb_shared(skb)) {
> -				kfree_skb(orig_skb);
>  			} else {
> -				kfree_skb(skb);
> -				skb = orig_skb;
> +				kfree_skb(ack_skb);
>  			}

So you're removing this part, but can't we really not reuse the clone_sk
copy? The difference is that it's charged, but that's fine for the
purposes here, no? Or am I misunderstanding that?

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arend van Spriel Sept. 11, 2014, 9:38 a.m. UTC | #2
On 09/11/14 09:06, Johannes Berg wrote:
> On Wed, 2014-09-10 at 18:05 -0400, Alexander Duyck wrote:
>> There is a possible issue with the use, or lack thereof of sk_refcnt and
>> sk_wmem_alloc in the wifi ack status functionality.
>>
>> Specifically if a socket were to request acknowledgements, and the socket
>> were to have sk_refcnt drop to 0 resulting in it waiting on sk_wmem_alloc
>> to reach 0 it would be possible to have sock_queue_err_skb orphan the last
>> buffer, resulting in __sk_free being called on the socket.  After this the
>> buffer is enqueued on sk_error_queue, however the queue has already been
>> flushed resulting in at least a memory leak, if not a data corruption.
>
> Oh. Thanks :-)

Hi Alexander,

So why is this only an issue in wifi ack path. The sock_queue_err_skb() 
does not mention the caller should hold a sock reference. This seems 
entirely an issue of the sock_queue_err_skb() function itself so why not 
do sk_hold/sk_put within that function. Does it impose too much overhead?

Regards,
Arend
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duyck, Alexander H Sept. 11, 2014, 2:40 p.m. UTC | #3
On 09/11/2014 02:38 AM, Arend van Spriel wrote:
> On 09/11/14 09:06, Johannes Berg wrote:
>> On Wed, 2014-09-10 at 18:05 -0400, Alexander Duyck wrote:
>>> There is a possible issue with the use, or lack thereof of sk_refcnt and
>>> sk_wmem_alloc in the wifi ack status functionality.
>>>
>>> Specifically if a socket were to request acknowledgements, and the
>>> socket
>>> were to have sk_refcnt drop to 0 resulting in it waiting on
>>> sk_wmem_alloc
>>> to reach 0 it would be possible to have sock_queue_err_skb orphan the
>>> last
>>> buffer, resulting in __sk_free being called on the socket.  After
>>> this the
>>> buffer is enqueued on sk_error_queue, however the queue has already been
>>> flushed resulting in at least a memory leak, if not a data corruption.
>>
>> Oh. Thanks :-)
> 
> Hi Alexander,
> 
> So why is this only an issue in wifi ack path. The sock_queue_err_skb()
> does not mention the caller should hold a sock reference. This seems
> entirely an issue of the sock_queue_err_skb() function itself so why not
> do sk_hold/sk_put within that function. Does it impose too much overhead?
> 
> Regards,
> Arend

I considered it but there are a number of cases where this is not an issue.

For example in the tx timestamping path there is the software timestamp
case where the buffer is cloned and the clone is queued immediately onto
the sk_error_queue.  In that case we still have a reference in the other
skb that is maintaining the socket.

So I thought it best to just address the cases where I know this could
be a problem.  I had already addressed it in the timestamping for
hardware timestamps where we are doing something similar.  So I thought
it would make sense to cover the other case that should have the same
problems.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duyck, Alexander H Sept. 11, 2014, 3:21 p.m. UTC | #4
On 09/11/2014 12:06 AM, Johannes Berg wrote:
> On Wed, 2014-09-10 at 18:05 -0400, Alexander Duyck wrote:
>> There is a possible issue with the use, or lack thereof of sk_refcnt and
>> sk_wmem_alloc in the wifi ack status functionality.
>>
>> Specifically if a socket were to request acknowledgements, and the socket
>> were to have sk_refcnt drop to 0 resulting in it waiting on sk_wmem_alloc
>> to reach 0 it would be possible to have sock_queue_err_skb orphan the last
>> buffer, resulting in __sk_free being called on the socket.  After this the
>> buffer is enqueued on sk_error_queue, however the queue has already been
>> flushed resulting in at least a memory leak, if not a data corruption.
> 
> Oh. Thanks :-)
> 
>> +	/* take a reference to prevent skb_orphan() from freeing the socket */
>> +	sock_hold(sk);
>> +
>>  	err = sock_queue_err_skb(sk, skb);
>>  	if (err)
>>  		kfree_skb(skb);
>> +
>> +	sock_put(sk);
>>  }
>>  EXPORT_SYMBOL_GPL(skb_complete_wifi_ack);
> 
> Here I'm not sure it matters *for this function*? Wouldn't it be freed
> then in sock_put(), which has the same net effect on this function
> overall? It doesn't use it after sock_queue_err_skb().

The significant piece is that we are calling sock_put *after*.  So if we
are dropping the last reference the buffer is already in the
sk_error_queue and will be purged when __sk_free is called.

> Seems like maybe this should be in sock_queue_err_skb() itself, since it
> does the orphaning first and then looks at the socket. Or the
> documentation for that function should state that it has to be held, but
> there are plenty of callers?

The problem is there are a number of cases where the sock_hold/put are
not needed.  For example, if we were to clone the skb and immediately
send the clone up the sk_error_queue then we don't need it.  We only
need it if there is a risk that orphaning the buffer sent could
potentially result in the destructor calling __sk_free.

>>  			spin_lock_irqsave(&local->ack_status_lock, flags);
>> -			id = idr_alloc(&local->ack_status_frames, orig_skb,
>> +			id = idr_alloc(&local->ack_status_frames, ack_skb,
>>  				       1, 0x10000, GFP_ATOMIC);
>>  			spin_unlock_irqrestore(&local->ack_status_lock, flags);
>>  
>>  			if (id >= 0) {
>>  				info_id = id;
>>  				info_flags |= IEEE80211_TX_CTL_REQ_TX_STATUS;
>> -			} else if (skb_shared(skb)) {
>> -				kfree_skb(orig_skb);
>>  			} else {
>> -				kfree_skb(skb);
>> -				skb = orig_skb;
>> +				kfree_skb(ack_skb);
>>  			}
> 
> So you're removing this part, but can't we really not reuse the clone_sk
> copy? The difference is that it's charged, but that's fine for the
> purposes here, no? Or am I misunderstanding that?
> 
> johannes

The copy being held cannot really be used for transmit.  The problem is
that it is holding the wrong kind of reference.

The problem lies in the order things are released.  The sock_put
function will dec_and_test sk_refcnt, once it reaches 0 it will do a
dec_and_test on sk_wmem_alloc to see if it should call __sk_free.  Until
that reaches 0 sk_wmem_alloc cannot reach 0.  Once either of these drops
to 0 we cannot bring the value back up from there.  So if I were to
transmit the clone then it could let the sk_refcnt drop to 0 in which
case any calls to sock_hold are invalid.

I would need to somehow hold the reference based on sk_wmem_alloc if we
want to transmit the clone.  Many of the hardware timestamping drivers
seem to just clone the original skb, queue that clone onto the
sk_error_queue, and then free the original after completing the call.  I
suppose we could change it to something like that, but you are still
looking at possibly 2 clones in that case anyway.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Sept. 11, 2014, 3:53 p.m. UTC | #5
On Thu, 2014-09-11 at 08:21 -0700, Alexander Duyck wrote:

[...]
> >>  EXPORT_SYMBOL_GPL(skb_complete_wifi_ack);
> > 
> > Here I'm not sure it matters *for this function*? Wouldn't it be freed
> > then in sock_put(), which has the same net effect on this function
> > overall? It doesn't use it after sock_queue_err_skb().
> 
> The significant piece is that we are calling sock_put *after*.  So if we
> are dropping the last reference the buffer is already in the
> sk_error_queue and will be purged when __sk_free is called.

Yeah, I understand. But that's more of a problem of sock_queue_err_skb()
rather than this function.

> > Seems like maybe this should be in sock_queue_err_skb() itself, since it
> > does the orphaning first and then looks at the socket. Or the
> > documentation for that function should state that it has to be held, but
> > there are plenty of callers?
> 
> The problem is there are a number of cases where the sock_hold/put are
> not needed.  For example, if we were to clone the skb and immediately
> send the clone up the sk_error_queue then we don't need it.  We only
> need it if there is a risk that orphaning the buffer sent could
> potentially result in the destructor calling __sk_free.

Ok, that's reasonable. Maybe then you can add that to the documentation
of sock_queue_err_skb() - that it must (somehow) ensure the socket can't
go away while it's being called? That way this caller change would
become clearer IMHO.

> > So you're removing this part, but can't we really not reuse the clone_sk
> > copy? The difference is that it's charged, but that's fine for the
> > purposes here, no? Or am I misunderstanding that?

> The copy being held cannot really be used for transmit.  The problem is
> that it is holding the wrong kind of reference.

Ok.

> The problem lies in the order things are released.  The sock_put
> function will dec_and_test sk_refcnt, once it reaches 0 it will do a
> dec_and_test on sk_wmem_alloc to see if it should call __sk_free.  Until
> that reaches 0 sk_wmem_alloc cannot reach 0.  Once either of these drops
> to 0 we cannot bring the value back up from there.  So if I were to
> transmit the clone then it could let the sk_refcnt drop to 0 in which
> case any calls to sock_hold are invalid.
> 
> I would need to somehow hold the reference based on sk_wmem_alloc if we
> want to transmit the clone.  Many of the hardware timestamping drivers
> seem to just clone the original skb, queue that clone onto the
> sk_error_queue, and then free the original after completing the call.  I
> suppose we could change it to something like that, but you are still
> looking at possibly 2 clones in that case anyway.

Well, no need. I just had originally wanted to reuse the clone so under
these corner case conditions we didn't clone twice - no big deal, it
never happens anyway (that IDR thing should never actually run out of
space)

Anyway, thanks. I expect due to the patch 1 davem will apply both
patches (and I'm going to be on vacation anyway), so 

Acked-by: Johannes Berg <johannes@sipsolutions.net>

for both patches.

Thanks!

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c9da77a..c8259ac 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3628,9 +3628,14 @@  void skb_complete_wifi_ack(struct sk_buff *skb, bool acked)
 	serr->ee.ee_errno = ENOMSG;
 	serr->ee.ee_origin = SO_EE_ORIGIN_TXSTATUS;
 
+	/* take a reference to prevent skb_orphan() from freeing the socket */
+	sock_hold(sk);
+
 	err = sock_queue_err_skb(sk, skb);
 	if (err)
 		kfree_skb(skb);
+
+	sock_put(sk);
 }
 EXPORT_SYMBOL_GPL(skb_complete_wifi_ack);
 
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 925c39f..cf71414 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -2072,30 +2072,23 @@  netdev_tx_t ieee80211_subif_start_xmit(struct sk_buff *skb,
 
 	if (unlikely(!multicast && skb->sk &&
 		     skb_shinfo(skb)->tx_flags & SKBTX_WIFI_STATUS)) {
-		struct sk_buff *orig_skb = skb;
+		struct sk_buff *ack_skb = skb_clone_sk(skb);
 
-		skb = skb_clone(skb, GFP_ATOMIC);
-		if (skb) {
+		if (ack_skb) {
 			unsigned long flags;
 			int id;
 
 			spin_lock_irqsave(&local->ack_status_lock, flags);
-			id = idr_alloc(&local->ack_status_frames, orig_skb,
+			id = idr_alloc(&local->ack_status_frames, ack_skb,
 				       1, 0x10000, GFP_ATOMIC);
 			spin_unlock_irqrestore(&local->ack_status_lock, flags);
 
 			if (id >= 0) {
 				info_id = id;
 				info_flags |= IEEE80211_TX_CTL_REQ_TX_STATUS;
-			} else if (skb_shared(skb)) {
-				kfree_skb(orig_skb);
 			} else {
-				kfree_skb(skb);
-				skb = orig_skb;
+				kfree_skb(ack_skb);
 			}
-		} else {
-			/* couldn't clone -- lose tx status ... */
-			skb = orig_skb;
 		}
 	}