diff mbox

Multicast packet loss

Message ID 49A8FAFF.7060104@cosmosbay.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Feb. 28, 2009, 8:51 a.m. UTC
Kenny Chang a écrit :
> It's been a while since I updated this thread.  We've been running
> through the different suggestions and tabulating their effects, as well
> as trying out an Intel card.  The short story is that setting affinity
> and MSI works to some extent, and the Intel card doesn't seem to change
> things significantly.  The results don't seem consistent enough for us
> to be able to point to a smoking gun.
> 
> It does look like the 2.6.29-rc4 kernel performs okay with the Intel
> card, but this is not a real-time build and it's not likely to be in a
> supported Ubuntu distribution real soon.  We've reached the point where
> we'd like to look for an expert dedicated to work on this problem for a
> period of time.  The final result being some sort of solution to produce
> a realtime configuration with a reasonably "aged" kernel (.24~.28) that
> has multicast performance greater than or equal to that of 2.6.15.
> 
> If anybody is interested in devoting some compensated time to this
> issue, we're offering up a bounty:
> http://www.athenacr.com/bounties/multicast-performance/
> 
> For completeness, here's the table of our experiment results:
> 
> ====================== ================== ========= ==========
> =============== ============== ============== =================
> Kernel                 flavor             IRQ       affinity   *4x
> mcasttest*  *5x mcasttest* *6x mcasttest*  *Mtools2* [4]_
> ====================== ================== ========= ==========
> =============== ============== ============== =================
> Intel
> e1000e                                                                                                                 
> 
> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
> 
> 2.6.24.19              rt                |          any       |
> OK              Maybe          X                            
> 2.6.24.19              rt                |          CPU0      |
> OK              OK             X                            
> 2.6.24.19              generic           |          any       |
> X                                                           
> 2.6.24.19              generic           |          CPU0      |
> OK                                                          
> 2.6.29-rc3             vanilla-server    |          any       |
> X                                                           
> 2.6.29-rc3             vanilla-server    |          CPU0      |
> OK                                                          
> 2.6.29-rc4             vanilla-generic   |          any       |
> X                                             OK            
> 2.6.29-rc4             vanilla-generic   |          CPU0      | OK  
>           OK             OK [5]_        OK            
> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
> 
> Broadcom
> BNX2                                                                                                                
> 
> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
> 
> 2.6.24-19              rt                | MSI      any       |
> OK              OK             X                            
> 2.6.24-19              rt                | MSI      CPU0      |
> OK              Maybe          X                            
> 2.6.24-19              rt                | APIC     any       |
> OK              OK             X                            
> 2.6.24-19              rt                | APIC     CPU0      |
> OK              Maybe          X                            
> 2.6.24-19-bnx-latest   rt                | APIC     CPU0      |
> OK              X                                           
> 2.6.24-19              server            | MSI      any       |
> X                                                           
> 2.6.24-19              server            | MSI      CPU0      |
> OK                                                          
> 2.6.24-19              generic           | APIC     any       |
> X                                                           
> 2.6.24-19              generic           | APIC     CPU0      |
> OK                                                          
> 2.6.27-11              generic           | APIC     any       |
> X                                                           
> 2.6.27-11              generic           | APIC     CPU0      |
> OK              10% drop                                     
> 2.6.28-8               generic           | APIC     any       |
> OK              X                                            
> 2.6.28-8               generic           | APIC     CPU0      |
> OK              OK             0.5% drop                     
> 2.6.29-rc3             vanilla-server    | MSI      any       |
> X                                                           
> 2.6.29-rc3             vanilla-server    | MSI      CPU0      |
> X                                                           
> 2.6.29-rc3             vanilla-server    | APIC     any       |
> OK              X                                           
> 2.6.29-rc3             vanilla-server    | APIC     CPU0      |
> OK              OK                                          
> 2.6.29-rc4             vanilla-generic   | APIC     any       |
> X                                                           
> 2.6.29-rc4             vanilla-generic   | APIC     CPU0      |
> OK              3% drop        10% drop       X             
> ======================
> ==================+=========+==========+===============+==============+==============+=================
> 
> * [4] MTools2 is a test from 29West: http://www.29west.com/docs/TestNet/
> * [5] In 5 trials, 1 of the trials dropped 2%, 4 of the trials dropped
> nothing.
> 
> Kenny
> 

Hi Kenny

I am investigating how to reduce contention (and schedule() calls) on this workload.

Following patch already gave me less packet drops (but not yet *perfect*)
(10% packet loss instead of 30%, if 8 receivers on my 8 cpus machine)


David, this is a preliminary work, not meant for inclusion as is,
comments are welcome.

Thank you

[PATCH] net: sk_forward_alloc becomes an atomic_t

Commit 95766fff6b9a78d11fc2d3812dd035381690b55d
(UDP: Add memory accounting) introduced a regression for high rate UDP flows,
because of extra lock_sock() in udp_recvmsg()

In order to reduce need for lock_sock() in UDP receive path, we might need
to declare sk_forward_alloc as an atomic_t.

udp_recvmsg() can avoid a lock_sock()/release_sock() pair.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/sock.h   |   14 +++++++-------
 net/core/sock.c      |   31 +++++++++++++++++++------------
 net/core/stream.c    |    2 +-
 net/ipv4/af_inet.c   |    2 +-
 net/ipv4/inet_diag.c |    2 +-
 net/ipv4/tcp_input.c |    2 +-
 net/ipv4/udp.c       |    2 --
 net/ipv6/udp.c       |    2 --
 net/sched/em_meta.c  |    2 +-
 9 files changed, 31 insertions(+), 28 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet March 1, 2009, 5:03 p.m. UTC | #1
Eric Dumazet a écrit :
> Kenny Chang a écrit :
>> It's been a while since I updated this thread.  We've been running
>> through the different suggestions and tabulating their effects, as well
>> as trying out an Intel card.  The short story is that setting affinity
>> and MSI works to some extent, and the Intel card doesn't seem to change
>> things significantly.  The results don't seem consistent enough for us
>> to be able to point to a smoking gun.
>>
>> It does look like the 2.6.29-rc4 kernel performs okay with the Intel
>> card, but this is not a real-time build and it's not likely to be in a
>> supported Ubuntu distribution real soon.  We've reached the point where
>> we'd like to look for an expert dedicated to work on this problem for a
>> period of time.  The final result being some sort of solution to produce
>> a realtime configuration with a reasonably "aged" kernel (.24~.28) that
>> has multicast performance greater than or equal to that of 2.6.15.
>>
>> If anybody is interested in devoting some compensated time to this
>> issue, we're offering up a bounty:
>> http://www.athenacr.com/bounties/multicast-performance/
>>
>> For completeness, here's the table of our experiment results:
>>
>> ====================== ================== ========= ==========
>> =============== ============== ============== =================
>> Kernel                 flavor             IRQ       affinity   *4x
>> mcasttest*  *5x mcasttest* *6x mcasttest*  *Mtools2* [4]_
>> ====================== ================== ========= ==========
>> =============== ============== ============== =================
>> Intel
>> e1000e                                                                                                                 
>>
>> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
>>
>> 2.6.24.19              rt                |          any       |
>> OK              Maybe          X                            
>> 2.6.24.19              rt                |          CPU0      |
>> OK              OK             X                            
>> 2.6.24.19              generic           |          any       |
>> X                                                           
>> 2.6.24.19              generic           |          CPU0      |
>> OK                                                          
>> 2.6.29-rc3             vanilla-server    |          any       |
>> X                                                           
>> 2.6.29-rc3             vanilla-server    |          CPU0      |
>> OK                                                          
>> 2.6.29-rc4             vanilla-generic   |          any       |
>> X                                             OK            
>> 2.6.29-rc4             vanilla-generic   |          CPU0      | OK  
>>           OK             OK [5]_        OK            
>> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
>>
>> Broadcom
>> BNX2                                                                                                                
>>
>> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
>>
>> 2.6.24-19              rt                | MSI      any       |
>> OK              OK             X                            
>> 2.6.24-19              rt                | MSI      CPU0      |
>> OK              Maybe          X                            
>> 2.6.24-19              rt                | APIC     any       |
>> OK              OK             X                            
>> 2.6.24-19              rt                | APIC     CPU0      |
>> OK              Maybe          X                            
>> 2.6.24-19-bnx-latest   rt                | APIC     CPU0      |
>> OK              X                                           
>> 2.6.24-19              server            | MSI      any       |
>> X                                                           
>> 2.6.24-19              server            | MSI      CPU0      |
>> OK                                                          
>> 2.6.24-19              generic           | APIC     any       |
>> X                                                           
>> 2.6.24-19              generic           | APIC     CPU0      |
>> OK                                                          
>> 2.6.27-11              generic           | APIC     any       |
>> X                                                           
>> 2.6.27-11              generic           | APIC     CPU0      |
>> OK              10% drop                                     
>> 2.6.28-8               generic           | APIC     any       |
>> OK              X                                            
>> 2.6.28-8               generic           | APIC     CPU0      |
>> OK              OK             0.5% drop                     
>> 2.6.29-rc3             vanilla-server    | MSI      any       |
>> X                                                           
>> 2.6.29-rc3             vanilla-server    | MSI      CPU0      |
>> X                                                           
>> 2.6.29-rc3             vanilla-server    | APIC     any       |
>> OK              X                                           
>> 2.6.29-rc3             vanilla-server    | APIC     CPU0      |
>> OK              OK                                          
>> 2.6.29-rc4             vanilla-generic   | APIC     any       |
>> X                                                           
>> 2.6.29-rc4             vanilla-generic   | APIC     CPU0      |
>> OK              3% drop        10% drop       X             
>> ======================
>> ==================+=========+==========+===============+==============+==============+=================
>>
>> * [4] MTools2 is a test from 29West: http://www.29west.com/docs/TestNet/
>> * [5] In 5 trials, 1 of the trials dropped 2%, 4 of the trials dropped
>> nothing.
>>
>> Kenny
>>
> 
> Hi Kenny
> 
> I am investigating how to reduce contention (and schedule() calls) on this workload.
> 

I bound NIC (gigabit BNX2) irq to cpu 0, so that oprofile results on this cpu can show us
where ksoftirqd is spending its time.

We can see scheduler at work :)

Also, one thing to note is __copy_skb_header() : 9.49 % of cpu0 time.
The problem comes from dst_clone() (6.05 % total, so 2/3 of __copy_skb_header()),
touching a highly contended cache line. (other cpus are doing the decrement of
dst refcounter)

CPU: Core 2, speed 3000.05 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) 
with a unit mask of 0x00 (Unhalted core cycles) count 100000
Samples on CPU 0
(samples for other cpus 1..7 omitted)
samples  cum. samples  %        cum. %     symbol name
23750    23750          9.8159   9.8159    try_to_wake_up
22972    46722          9.4944  19.3103    __copy_skb_header
20217    66939          8.3557  27.6660    enqueue_task_fair
14565    81504          6.0197  33.6857    sock_def_readable
13454    94958          5.5606  39.2463    task_rq_lock
13381    108339         5.5304  44.7767    resched_task
13090    121429         5.4101  50.1868    udp_queue_rcv_skb
11441    132870         4.7286  54.9154    skb_queue_tail
10109    142979         4.1781  59.0935    sock_queue_rcv_skb
10024    153003         4.1429  63.2364    __wake_up_sync
9952     162955         4.1132  67.3496    update_curr
8761     171716         3.6209  70.9705    sched_clock_cpu
7414     179130         3.0642  74.0347    rb_insert_color
7381     186511         3.0506  77.0853    select_task_rq_fair
6749     193260         2.7894  79.8747    __slab_alloc
5881     199141         2.4306  82.3053    __wake_up_common
5432     204573         2.2451  84.5504    __skb_clone
4306     208879         1.7797  86.3300    kmem_cache_alloc
3524     212403         1.4565  87.7865    place_entity
2783     215186         1.1502  88.9367    skb_clone
2576     217762         1.0647  90.0014    __udp4_lib_rcv
2430     220192         1.0043  91.0057    bnx2_poll_work
2184     222376         0.9027  91.9084    ipt_do_table
2090     224466         0.8638  92.7722    ip_route_input
1877     226343         0.7758  93.5479    __alloc_skb
1495     227838         0.6179  94.1658    native_sched_clock
1166     229004         0.4819  94.6477    __update_sched_clock
1083     230087         0.4476  95.0953    netif_receive_skb
1062     231149         0.4389  95.5343    activate_task
644      231793         0.2662  95.8004    __kmalloc_track_caller
638      232431         0.2637  96.0641    nf_iterate
549      232980         0.2269  96.2910    skb_put

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 4, 2009, 8:16 a.m. UTC | #2
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 28 Feb 2009 09:51:11 +0100

> David, this is a preliminary work, not meant for inclusion as is,
> comments are welcome.
> 
> [PATCH] net: sk_forward_alloc becomes an atomic_t
> 
> Commit 95766fff6b9a78d11fc2d3812dd035381690b55d
> (UDP: Add memory accounting) introduced a regression for high rate UDP flows,
> because of extra lock_sock() in udp_recvmsg()
> 
> In order to reduce need for lock_sock() in UDP receive path, we might need
> to declare sk_forward_alloc as an atomic_t.
> 
> udp_recvmsg() can avoid a lock_sock()/release_sock() pair.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

This adds new overhead for TCP which has to hold the socket
lock for other reasons in these paths.

I don't get how an atomic_t operation is cheaper than a
lock_sock/release_sock.  Is it the case that in many
executions of these paths only atomic_read()'s are necessary?

I actually think this scheme is racy.  There is a reason we
have to hold the socket lock when doing memory scheduling.
Two threads can get in there and say "hey I have enough space
already" even though only enough space is allocated for one
of their requests.

What did I miss? :)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 4, 2009, 8:36 a.m. UTC | #3
David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Sat, 28 Feb 2009 09:51:11 +0100
> 
>> David, this is a preliminary work, not meant for inclusion as is,
>> comments are welcome.
>>
>> [PATCH] net: sk_forward_alloc becomes an atomic_t
>>
>> Commit 95766fff6b9a78d11fc2d3812dd035381690b55d
>> (UDP: Add memory accounting) introduced a regression for high rate UDP flows,
>> because of extra lock_sock() in udp_recvmsg()
>>
>> In order to reduce need for lock_sock() in UDP receive path, we might need
>> to declare sk_forward_alloc as an atomic_t.
>>
>> udp_recvmsg() can avoid a lock_sock()/release_sock() pair.
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> This adds new overhead for TCP which has to hold the socket
> lock for other reasons in these paths.
> 
> I don't get how an atomic_t operation is cheaper than a
> lock_sock/release_sock.  Is it the case that in many
> executions of these paths only atomic_read()'s are necessary?
> 
> I actually think this scheme is racy.  There is a reason we
> have to hold the socket lock when doing memory scheduling.
> Two threads can get in there and say "hey I have enough space
> already" even though only enough space is allocated for one
> of their requests.
> 
> What did I miss? :)
> 

I believe you are right, and in fact was about to post a "dont look at this patch"
since it doesnt help the multicast reception at all, I redone tests more carefuly 
and got nothing but noise.

We have a cache line ping pong mess here, and need more thinking.

I rewrote Kenny prog to use non blocking sockets.

Receivers are doing :

        int delay = 50;
	fcntl(s, F_SETFL, O_NDELAY);
        while(1)
        {
            struct sockaddr_in from;
            socklen_t fromlen = sizeof(from);
            res = recvfrom(s, buf, 1000, 0, (struct sockaddr*)&from, &fromlen);
            if (res == -1) {
                      delay++;
                      usleep(delay);
                      continue;
            }
            if (delay > 40)
                delay--;
            ++npackets;

With this litle user space change and 8 receivers on my dual quad core, softirqd
only takes 8% of one cpu and no drops at all (instead of 100% cpu and 30% drops)

So this is definitly a problem mixing scheduler cache line ping pongs with network
stack cache line ping pongs.

We could reorder fields so that fewer cache lines are touched by the softirq processing,
I tried this but still got packet drops.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 7, 2009, 7:46 a.m. UTC | #4
Eric Dumazet a écrit :
> David Miller a écrit :
>> From: Eric Dumazet <dada1@cosmosbay.com>
>> Date: Sat, 28 Feb 2009 09:51:11 +0100
>>
>>> David, this is a preliminary work, not meant for inclusion as is,
>>> comments are welcome.
>>>
>>> [PATCH] net: sk_forward_alloc becomes an atomic_t
>>>
>>> Commit 95766fff6b9a78d11fc2d3812dd035381690b55d
>>> (UDP: Add memory accounting) introduced a regression for high rate UDP flows,
>>> because of extra lock_sock() in udp_recvmsg()
>>>
>>> In order to reduce need for lock_sock() in UDP receive path, we might need
>>> to declare sk_forward_alloc as an atomic_t.
>>>
>>> udp_recvmsg() can avoid a lock_sock()/release_sock() pair.
>>>
>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>> This adds new overhead for TCP which has to hold the socket
>> lock for other reasons in these paths.
>>
>> I don't get how an atomic_t operation is cheaper than a
>> lock_sock/release_sock.  Is it the case that in many
>> executions of these paths only atomic_read()'s are necessary?
>>
>> I actually think this scheme is racy.  There is a reason we
>> have to hold the socket lock when doing memory scheduling.
>> Two threads can get in there and say "hey I have enough space
>> already" even though only enough space is allocated for one
>> of their requests.
>>
>> What did I miss? :)
>>
> 
> I believe you are right, and in fact was about to post a "dont look at this patch"
> since it doesnt help the multicast reception at all, I redone tests more carefuly 
> and got nothing but noise.
> 
> We have a cache line ping pong mess here, and need more thinking.
> 
> I rewrote Kenny prog to use non blocking sockets.
> 
> Receivers are doing :
> 
>         int delay = 50;
> 	fcntl(s, F_SETFL, O_NDELAY);
>         while(1)
>         {
>             struct sockaddr_in from;
>             socklen_t fromlen = sizeof(from);
>             res = recvfrom(s, buf, 1000, 0, (struct sockaddr*)&from, &fromlen);
>             if (res == -1) {
>                       delay++;
>                       usleep(delay);
>                       continue;
>             }
>             if (delay > 40)
>                 delay--;
>             ++npackets;
> 
> With this litle user space change and 8 receivers on my dual quad core, softirqd
> only takes 8% of one cpu and no drops at all (instead of 100% cpu and 30% drops)
> 
> So this is definitly a problem mixing scheduler cache line ping pongs with network
> stack cache line ping pongs.
> 
> We could reorder fields so that fewer cache lines are touched by the softirq processing,
> I tried this but still got packet drops.
> 
> 
> 

I have more questions :

What is the maximum latency you can afford on the delivery of the packet(s) ?

Are user apps using real time scheduling ?

I had an idea, that keep cpu handling NIC interrupts only delivering packets to
socket queues, and not messing with scheduler : fast queueing, and wakeing up
a workqueue (on another cpu) to perform the scheduler work. But that means
some extra latency (in the order of 2 or 3 us I guess)

We could enter in this mode automatically, if the NIC rx handler *see* more than
N packets are waiting in NIC queue : In case of moderate or light trafic, no
extra latency would be necessary. This would mean some changes in NIC driver.

Hum, then, if NIC rx handler is run beside the ksoftirqd, we already know
we are in a stress situation, so maybe no driver changes are necessary :
Just test if we run ksoftirqd...


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/sock.h b/include/net/sock.h
index 4bb1ff9..c4befb9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -250,7 +250,7 @@  struct sock {
 	struct sk_buff_head	sk_async_wait_queue;
 #endif
 	int			sk_wmem_queued;
-	int			sk_forward_alloc;
+	atomic_t		sk_forward_alloc;
 	gfp_t			sk_allocation;
 	int			sk_route_caps;
 	int			sk_gso_type;
@@ -823,7 +823,7 @@  static inline int sk_wmem_schedule(struct sock *sk, int size)
 {
 	if (!sk_has_account(sk))
 		return 1;
-	return size <= sk->sk_forward_alloc ||
+	return size <= atomic_read(&sk->sk_forward_alloc) ||
 		__sk_mem_schedule(sk, size, SK_MEM_SEND);
 }
 
@@ -831,7 +831,7 @@  static inline int sk_rmem_schedule(struct sock *sk, int size)
 {
 	if (!sk_has_account(sk))
 		return 1;
-	return size <= sk->sk_forward_alloc ||
+	return size <= atomic_read(&sk->sk_forward_alloc) ||
 		__sk_mem_schedule(sk, size, SK_MEM_RECV);
 }
 
@@ -839,7 +839,7 @@  static inline void sk_mem_reclaim(struct sock *sk)
 {
 	if (!sk_has_account(sk))
 		return;
-	if (sk->sk_forward_alloc >= SK_MEM_QUANTUM)
+	if (atomic_read(&sk->sk_forward_alloc) >= SK_MEM_QUANTUM)
 		__sk_mem_reclaim(sk);
 }
 
@@ -847,7 +847,7 @@  static inline void sk_mem_reclaim_partial(struct sock *sk)
 {
 	if (!sk_has_account(sk))
 		return;
-	if (sk->sk_forward_alloc > SK_MEM_QUANTUM)
+	if (atomic_read(&sk->sk_forward_alloc) > SK_MEM_QUANTUM)
 		__sk_mem_reclaim(sk);
 }
 
@@ -855,14 +855,14 @@  static inline void sk_mem_charge(struct sock *sk, int size)
 {
 	if (!sk_has_account(sk))
 		return;
-	sk->sk_forward_alloc -= size;
+	atomic_sub(size, &sk->sk_forward_alloc);
 }
 
 static inline void sk_mem_uncharge(struct sock *sk, int size)
 {
 	if (!sk_has_account(sk))
 		return;
-	sk->sk_forward_alloc += size;
+	atomic_add(size, &sk->sk_forward_alloc);
 }
 
 static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
diff --git a/net/core/sock.c b/net/core/sock.c
index 0620046..8489105 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1081,7 +1081,7 @@  struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 
 		newsk->sk_dst_cache	= NULL;
 		newsk->sk_wmem_queued	= 0;
-		newsk->sk_forward_alloc = 0;
+		atomic_set(&newsk->sk_forward_alloc, 0);
 		newsk->sk_send_head	= NULL;
 		newsk->sk_userlocks	= sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
 
@@ -1479,7 +1479,7 @@  int __sk_mem_schedule(struct sock *sk, int size, int kind)
 	int amt = sk_mem_pages(size);
 	int allocated;
 
-	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
+	atomic_add(amt * SK_MEM_QUANTUM, &sk->sk_forward_alloc);
 	allocated = atomic_add_return(amt, prot->memory_allocated);
 
 	/* Under limit. */
@@ -1520,7 +1520,7 @@  int __sk_mem_schedule(struct sock *sk, int size, int kind)
 		if (prot->sysctl_mem[2] > alloc *
 		    sk_mem_pages(sk->sk_wmem_queued +
 				 atomic_read(&sk->sk_rmem_alloc) +
-				 sk->sk_forward_alloc))
+				 atomic_read(&sk->sk_forward_alloc)))
 			return 1;
 	}
 
@@ -1537,7 +1537,7 @@  suppress_allocation:
 	}
 
 	/* Alas. Undo changes. */
-	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
+	atomic_sub(amt * SK_MEM_QUANTUM, &sk->sk_forward_alloc);
 	atomic_sub(amt, prot->memory_allocated);
 	return 0;
 }
@@ -1551,14 +1551,21 @@  EXPORT_SYMBOL(__sk_mem_schedule);
 void __sk_mem_reclaim(struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
-
-	atomic_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
-		   prot->memory_allocated);
-	sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
-
-	if (prot->memory_pressure && *prot->memory_pressure &&
-	    (atomic_read(prot->memory_allocated) < prot->sysctl_mem[0]))
-		*prot->memory_pressure = 0;
+	int val = atomic_read(&sk->sk_forward_alloc);
+
+begin:
+	val = atomic_read(&sk->sk_forward_alloc);
+	if (val >= SK_MEM_QUANTUM) {
+		if (atomic_cmpxchg(&sk->sk_forward_alloc, val,
+				   val & (SK_MEM_QUANTUM - 1)) != val)
+			goto begin;
+		atomic_sub(val >> SK_MEM_QUANTUM_SHIFT,
+			   prot->memory_allocated);
+
+		if (prot->memory_pressure && *prot->memory_pressure &&
+		    (atomic_read(prot->memory_allocated) < prot->sysctl_mem[0]))
+			*prot->memory_pressure = 0;
+	}
 }
 
 EXPORT_SYMBOL(__sk_mem_reclaim);
diff --git a/net/core/stream.c b/net/core/stream.c
index 8727cea..4d04d28 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -198,7 +198,7 @@  void sk_stream_kill_queues(struct sock *sk)
 	sk_mem_reclaim(sk);
 
 	WARN_ON(sk->sk_wmem_queued);
-	WARN_ON(sk->sk_forward_alloc);
+	WARN_ON(atomic_read(&sk->sk_forward_alloc));
 
 	/* It is _impossible_ for the backlog to contain anything
 	 * when we get here.  All user references to this socket
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 627be4d..7a1475c 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -152,7 +152,7 @@  void inet_sock_destruct(struct sock *sk)
 	WARN_ON(atomic_read(&sk->sk_rmem_alloc));
 	WARN_ON(atomic_read(&sk->sk_wmem_alloc));
 	WARN_ON(sk->sk_wmem_queued);
-	WARN_ON(sk->sk_forward_alloc);
+	WARN_ON(atomic_read(&sk->sk_forward_alloc));
 
 	kfree(inet->opt);
 	dst_release(sk->sk_dst_cache);
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 588a779..903ad66 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -158,7 +158,7 @@  static int inet_csk_diag_fill(struct sock *sk,
 	if (minfo) {
 		minfo->idiag_rmem = atomic_read(&sk->sk_rmem_alloc);
 		minfo->idiag_wmem = sk->sk_wmem_queued;
-		minfo->idiag_fmem = sk->sk_forward_alloc;
+		minfo->idiag_fmem = atomic_read(&sk->sk_forward_alloc);
 		minfo->idiag_tmem = atomic_read(&sk->sk_wmem_alloc);
 	}
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a6961d7..5e08f37 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5258,7 +5258,7 @@  int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 
 				tcp_rcv_rtt_measure_ts(sk, skb);
 
-				if ((int)skb->truesize > sk->sk_forward_alloc)
+				if ((int)skb->truesize > atomic_read(&sk->sk_forward_alloc))
 					goto step5;
 
 				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITS);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 4bd178a..dcc246a 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -955,9 +955,7 @@  try_again:
 		err = ulen;
 
 out_free:
-	lock_sock(sk);
 	skb_free_datagram(sk, skb);
-	release_sock(sk);
 out:
 	return err;
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 84b1a29..582b80a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -257,9 +257,7 @@  try_again:
 		err = ulen;
 
 out_free:
-	lock_sock(sk);
 	skb_free_datagram(sk, skb);
-	release_sock(sk);
 out:
 	return err;
 
diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
index 72cf86e..94d90b6 100644
--- a/net/sched/em_meta.c
+++ b/net/sched/em_meta.c
@@ -383,7 +383,7 @@  META_COLLECTOR(int_sk_wmem_queued)
 META_COLLECTOR(int_sk_fwd_alloc)
 {
 	SKIP_NONLOCAL(skb);
-	dst->value = skb->sk->sk_forward_alloc;
+	dst->value = atomic_read(&skb->sk->sk_forward_alloc);
 }
 
 META_COLLECTOR(int_sk_sndbuf)