diff mbox

memcg/tcp: fix warning caused b res->usage go to negative.

Message ID 4F7F1091.9040204@parallels.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Glauber Costa April 6, 2012, 3:49 p.m. UTC
On 03/30/2012 05:44 AM, KAMEZAWA Hiroyuki wrote:
> Maybe what we can do before lsf/mm summit will be this (avoid warning.)
> This patch is onto linus's git tree. Patch description is updated.
> 
> Thanks.
> -Kame
> ==
>  From 4ab80f84bbcb02a790342426c1de84aeb17fcbe9 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
> Date: Thu, 29 Mar 2012 14:59:04 +0900
> Subject: [PATCH] memcg/tcp: fix warning caused b res->usage go to negative.
> 
> tcp memcontrol starts accouting after res->limit is set. So, if a sockets
> starts before setting res->limit, there are already used resource.
> At setting res->limit, accounting starts. The resource will be uncharged
> and make res_counter below 0 because they are not charged.
> This causes warning.
> 

Kame,

Please test the following patch and see if it fixes your problems (I
tested locally, and it triggers me no warnings running the test script
you provided + an inbound scp -r copy of an iso directory from a remote
machine)

When you are reviewing, keep in mind that we're likely to have the same
problems with slab jump labels - since the slab pages will outlive the
cgroup as well, and it might be worthy to keep this in mind, and provide
a central point for the jump labels to be set of on cgroup destruction.

Comments

KAMEZAWA Hiroyuki April 10, 2012, 2:37 a.m. UTC | #1
(2012/04/07 0:49), Glauber Costa wrote:

> On 03/30/2012 05:44 AM, KAMEZAWA Hiroyuki wrote:
>> Maybe what we can do before lsf/mm summit will be this (avoid warning.)
>> This patch is onto linus's git tree. Patch description is updated.
>>
>> Thanks.
>> -Kame
>> ==
>>  From 4ab80f84bbcb02a790342426c1de84aeb17fcbe9 Mon Sep 17 00:00:00 2001
>> From: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
>> Date: Thu, 29 Mar 2012 14:59:04 +0900
>> Subject: [PATCH] memcg/tcp: fix warning caused b res->usage go to negative.
>>
>> tcp memcontrol starts accouting after res->limit is set. So, if a sockets
>> starts before setting res->limit, there are already used resource.
>> At setting res->limit, accounting starts. The resource will be uncharged
>> and make res_counter below 0 because they are not charged.
>> This causes warning.
>>
> 
> Kame,
> 
> Please test the following patch and see if it fixes your problems (I
> tested locally, and it triggers me no warnings running the test script
> you provided + an inbound scp -r copy of an iso directory from a remote
> machine)
> 
> When you are reviewing, keep in mind that we're likely to have the same
> problems with slab jump labels - since the slab pages will outlive the
> cgroup as well, and it might be worthy to keep this in mind, and provide
> a central point for the jump labels to be set of on cgroup destruction.
> 


Hm. What happens in following sequence ?

  1. a memcg is created
  2. put a task into the memcg, start tcp steam
  3. set tcp memory limit

The resource used between 2 and 3 will cause the problem finally.

Then, Dave's request
==
You must either:

1) Integrate the socket's existing usage when the limit is set.

2) Avoid accounting completely for a socket that started before
   the limit was set.
==
are not satisfied. So, we need to have a state per sockets, it's accounted
or not. I'll look into this problem again, today.


Thanks,
-Kame









--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Glauber Costa April 10, 2012, 2:51 a.m. UTC | #2
On 04/09/2012 11:37 PM, KAMEZAWA Hiroyuki wrote:
> Hm. What happens in following sequence ?
> 
>    1. a memcg is created
>    2. put a task into the memcg, start tcp steam
>    3. set tcp memory limit
> 
> The resource used between 2 and 3 will cause the problem finally.

I don't get it. if a task is in memcg, but no limit is set,
that socket will be assigned null memcg, and will stay like that
forever. Only new sockets will have the new memcg pointer.

And previously, we could have the memcg pointer alive, but the jump
labels to be disabled. With the patch I posted, this can't happen
anymore, since the jump labels are guaranteed to live throughout the
whole socket life.

> Then, Dave's request
> ==
> You must either:
> 
> 1) Integrate the socket's existing usage when the limit is set.
> 
> 2) Avoid accounting completely for a socket that started before
>     the limit was set.
> ==
> are not satisfied. So, we need to have a state per sockets, it's accounted
> or not. I'll look into this problem again, today.
> 

Of course they are.

Every socket created before we set the limit is not accounted.
This is 2) that Dave mentioned, and it was *always* this way.

The problem here was the opposite: You could disable the jump labels
with sockets still in flight, because we were disabling it based on
the limit being set back to unlimited.

What this patch does, is defer that until the last socket limited dies.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Glauber Costa April 10, 2012, 3:01 a.m. UTC | #3
On 04/09/2012 11:51 PM, Glauber Costa wrote:
> On 04/09/2012 11:37 PM, KAMEZAWA Hiroyuki wrote:
>> Hm. What happens in following sequence ?
>>
>>     1. a memcg is created
>>     2. put a task into the memcg, start tcp steam
>>     3. set tcp memory limit
>>
>> The resource used between 2 and 3 will cause the problem finally.
> 
> I don't get it. if a task is in memcg, but no limit is set,
> that socket will be assigned null memcg, and will stay like that
> forever. Only new sockets will have the new memcg pointer.
> 
> And previously, we could have the memcg pointer alive, but the jump
> labels to be disabled. With the patch I posted, this can't happen
> anymore, since the jump labels are guaranteed to live throughout the
> whole socket life.
> 
>> Then, Dave's request
>> ==
>> You must either:
>>
>> 1) Integrate the socket's existing usage when the limit is set.
>>
>> 2) Avoid accounting completely for a socket that started before
>>      the limit was set.
>> ==
>> are not satisfied. So, we need to have a state per sockets, it's accounted
>> or not. I'll look into this problem again, today.
>>
> 
> Of course they are.
> 
> Every socket created before we set the limit is not accounted.
> This is 2) that Dave mentioned, and it was *always* this way.
> 
> The problem here was the opposite: You could disable the jump labels
> with sockets still in flight, because we were disabling it based on
> the limit being set back to unlimited.
> 
> What this patch does, is defer that until the last socket limited dies.
> 

Okay, there is an additional thing to be considered here:

Due to the nature of how jump label works, once they are enabled for one
of the cgroups, they will be enabled for all of them. So the patch I
sent may still break in some scenarios because of the way we record that
the limit was set.

However, if my theory behind what is causing the problem is correct,
this patch should fix the issue for you. Let me know if it does, and
I'll work on the final solution.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
KAMEZAWA Hiroyuki April 10, 2012, 3:21 a.m. UTC | #4
(2012/04/10 11:51), Glauber Costa wrote:

> On 04/09/2012 11:37 PM, KAMEZAWA Hiroyuki wrote:
>> Hm. What happens in following sequence ?
>>
>>    1. a memcg is created
>>    2. put a task into the memcg, start tcp steam
>>    3. set tcp memory limit
>>
>> The resource used between 2 and 3 will cause the problem finally.
> 
> I don't get it. if a task is in memcg, but no limit is set,
> that socket will be assigned null memcg, and will stay like that
> forever. Only new sockets will have the new memcg pointer.
> 
> And previously, we could have the memcg pointer alive, but the jump
> labels to be disabled. With the patch I posted, this can't happen
> anymore, since the jump labels are guaranteed to live throughout the
> whole socket life.
> 
>> Then, Dave's request
>> ==
>> You must either:
>>
>> 1) Integrate the socket's existing usage when the limit is set.
>>
>> 2) Avoid accounting completely for a socket that started before
>>     the limit was set.
>> ==
>> are not satisfied. So, we need to have a state per sockets, it's accounted
>> or not. I'll look into this problem again, today.
>>
> 
> Of course they are.
> 
> Every socket created before we set the limit is not accounted.
> This is 2) that Dave mentioned, and it was *always* this way.
> 
> The problem here was the opposite: You could disable the jump labels
> with sockets still in flight, because we were disabling it based on
> the limit being set back to unlimited.
> 
> What this patch does, is defer that until the last socket limited dies.
> 

Thank you for explanation. Hmm, sk->cgrp check ?
Ah, yes it's updated by sock_update_memcg() under jump_label, which is
called by tcp_v4_init_sock().
Hm. and jump_label()'s atomic counter and mutex_lock will be a guard against
set/unset race. Ok.

BTW, what will happen in following case ?

Assume that the last memcg is destroyed and call jump_label_dec. And the
thread waits for jump_label_mutex for a while.

      CPU A                                           CPU B

    jump_label_dec() # mutex will be held        sock_update_memcg() is called
                                                 sk_cgrp is set.
    ...modify instructions                       some accounting is done.
    mutex_unlock()

I wonder you need some serialization somewhere OR disallow turning off accounting.                                              

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
KAMEZAWA Hiroyuki April 10, 2012, 4:15 a.m. UTC | #5
(2012/04/10 12:01), Glauber Costa wrote:

> On 04/09/2012 11:51 PM, Glauber Costa wrote:
>> On 04/09/2012 11:37 PM, KAMEZAWA Hiroyuki wrote:
>>> Hm. What happens in following sequence ?
>>>
>>>     1. a memcg is created
>>>     2. put a task into the memcg, start tcp steam
>>>     3. set tcp memory limit
>>>
>>> The resource used between 2 and 3 will cause the problem finally.
>>
>> I don't get it. if a task is in memcg, but no limit is set,
>> that socket will be assigned null memcg, and will stay like that
>> forever. Only new sockets will have the new memcg pointer.
>>
>> And previously, we could have the memcg pointer alive, but the jump
>> labels to be disabled. With the patch I posted, this can't happen
>> anymore, since the jump labels are guaranteed to live throughout the
>> whole socket life.
>>
>>> Then, Dave's request
>>> ==
>>> You must either:
>>>
>>> 1) Integrate the socket's existing usage when the limit is set.
>>>
>>> 2) Avoid accounting completely for a socket that started before
>>>      the limit was set.
>>> ==
>>> are not satisfied. So, we need to have a state per sockets, it's accounted
>>> or not. I'll look into this problem again, today.
>>>
>>
>> Of course they are.
>>
>> Every socket created before we set the limit is not accounted.
>> This is 2) that Dave mentioned, and it was *always* this way.
>>
>> The problem here was the opposite: You could disable the jump labels
>> with sockets still in flight, because we were disabling it based on
>> the limit being set back to unlimited.
>>
>> What this patch does, is defer that until the last socket limited dies.
>>
> 
> Okay, there is an additional thing to be considered here:
> 
> Due to the nature of how jump label works, once they are enabled for one
> of the cgroups, they will be enabled for all of them. So the patch I
> sent may still break in some scenarios because of the way we record that
> the limit was set.
> 
> However, if my theory behind what is causing the problem is correct,
> this patch should fix the issue for you. 


Now, our issue is leak of accounting, regardless of warning.

> Let me know if it does, and

> I'll work on the final solution.
> 


The problem is that jump_label updating is not atomic_ops.

I'm _not_ sure the update order of the jump_label in sock_update_memcg()
and other jump instructions inserted at accounting.

For example, if the jump instruction in sock_update_memcg() is updated 1st
and others are updated later, it's unclear whether sockets which has _valid_
sock->sk_cgrp will be accounted or not because accounting jump instruction
may not be updated yet.

Hopefully, label in sock_update_memcg should be updated last...

Hm. If I do, I'll add one more key as:

atomic_t	sock_should_memcg_aware;

And update 2 keys in following order.

At enable
	static_key_slow_inc(&memcg_socket_limit_enabled)
	atomic_inc(&sock_should_memcg_aware);

At disable
	atomic_dec(&sock_should_memcg_aware);
	static_key_slow_dec(&memcg_socket_limit_enabled)

And
==
void sock_update_memcg(struct sock *sk)
{
        if (atomic_read(&sock_should_memcg_aware)) {

==


Thanks,
-Kame










--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Glauber Costa April 11, 2012, 2:22 a.m. UTC | #6
> 
> The problem is that jump_label updating is not atomic_ops.

That's indeed really bad.
You seem to be right about that... I wonder, however, if we should
prevent anything to run in any cpu during the jump_label update?
The update can be slow, sure, but it is not expected to be frequent...

> I'm _not_ sure the update order of the jump_label in sock_update_memcg()
> and other jump instructions inserted at accounting.

Any ordering assumptions here would be extremely fragile, even if we
could make one.

> For example, if the jump instruction in sock_update_memcg() is updated 1st
> and others are updated later, it's unclear whether sockets which has _valid_
> sock->sk_cgrp will be accounted or not because accounting jump instruction
> may not be updated yet.
> 
> Hopefully, label in sock_update_memcg should be updated last...
> 
> Hm. If I do, I'll add one more key as:
> 
> atomic_t	sock_should_memcg_aware;
> 
> And update 2 keys in following order.
> 
> At enable
> 	static_key_slow_inc(&memcg_socket_limit_enabled)
> 	atomic_inc(&sock_should_memcg_aware);
> 
> At disable
> 	atomic_dec(&sock_should_memcg_aware);
> 	static_key_slow_dec(&memcg_socket_limit_enabled)
> 
> And
> ==
> void sock_update_memcg(struct sock *sk)
> {
>          if (atomic_read(&sock_should_memcg_aware)) {
> 
> ==

Problem here is that having an atomic variable here defeats a bit the
purpose of the jump labels.

If it is just in the update path, it is not *that* bad.

But unless we go fix the jump labels to prevent such thing from
happening, maybe it would be better to have two jump labels here?

One for the writer, that is updated last, and one from the readers.
This way we can force the ordering the way we want.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

From c40bbd69cbb655b6389c2398ce89abb06e64910d Mon Sep 17 00:00:00 2001
From: Glauber Costa <glommer@parallels.com>
Date: Wed, 4 Apr 2012 21:08:38 +0400
Subject: [PATCH] decrement static keys on real destroy time

We call the destroy function when a cgroup starts to be removed,
such as by a rmdir event.

However, because of our reference counters, some objects are still
inflight. Right now, we are decrementing the static_keys at destroy()
time, meaning that if we get rid of the last static_key reference,
some objects will still have charges, but the code to properly
uncharge them won't be run.

This becomes a problem specially if it is ever enabled again, because
now new charges will be added to the staled charges making keeping
it pretty much impossible.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 include/net/tcp_memcontrol.h |    2 ++
 mm/memcontrol.c              |   15 +++++++++++++++
 net/ipv4/tcp_memcontrol.c    |   10 ++++------
 3 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/net/tcp_memcontrol.h b/include/net/tcp_memcontrol.h
index 7df18bc..5a2b915 100644
--- a/include/net/tcp_memcontrol.h
+++ b/include/net/tcp_memcontrol.h
@@ -9,6 +9,8 @@  struct tcp_memcontrol {
 	/* those two are read-mostly, leave them at the end */
 	long tcp_prot_mem[3];
 	int tcp_memory_pressure;
+	/* if this cgroup was ever limited, having static_keys activated */
+	bool limited;
 };
 
 struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 64a1bcd..74b757b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -442,6 +442,15 @@  void sock_release_memcg(struct sock *sk)
 	}
 }
 
+static void disarm_static_keys(struct mem_cgroup *memcg)
+{
+#ifdef CONFIG_INET
+	if (memcg->tcp_mem.limited)	
+		static_key_slow_dec(&memcg_socket_limit_enabled);
+#endif
+}
+
+
 #ifdef CONFIG_INET
 struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
 {
@@ -452,6 +461,11 @@  struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
 }
 EXPORT_SYMBOL(tcp_proto_cgroup);
 #endif /* CONFIG_INET */
+#else
+static inline void disarm_static_keys(struct mem_cgroup *memcg)
+{
+}
+
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 
 static void drain_all_stock_async(struct mem_cgroup *memcg);
@@ -4883,6 +4897,7 @@  static void __mem_cgroup_put(struct mem_cgroup *memcg, int count)
 {
 	if (atomic_sub_and_test(count, &memcg->refcnt)) {
 		struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+		disarm_static_keys(memcg);
 		__mem_cgroup_free(memcg);
 		if (parent)
 			mem_cgroup_put(parent);
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 1517037..93555ab 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -41,6 +41,7 @@  int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 	tcp->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
 	tcp->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
 	tcp->tcp_memory_pressure = 0;
+	tcp->limited = false;
 
 	parent_cg = tcp_prot.proto_cgroup(parent);
 	if (parent_cg)
@@ -74,9 +75,6 @@  void tcp_destroy_cgroup(struct mem_cgroup *memcg)
 	percpu_counter_destroy(&tcp->tcp_sockets_allocated);
 
 	val = res_counter_read_u64(&tcp->tcp_memory_allocated, RES_LIMIT);
-
-	if (val != RESOURCE_MAX)
-		static_key_slow_dec(&memcg_socket_limit_enabled);
 }
 EXPORT_SYMBOL(tcp_destroy_cgroup);
 
@@ -107,10 +105,10 @@  static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
 		tcp->tcp_prot_mem[i] = min_t(long, val >> PAGE_SHIFT,
 					     net->ipv4.sysctl_tcp_mem[i]);
 
-	if (val == RESOURCE_MAX && old_lim != RESOURCE_MAX)
-		static_key_slow_dec(&memcg_socket_limit_enabled);
-	else if (old_lim == RESOURCE_MAX && val != RESOURCE_MAX)
+	if (old_lim == RESOURCE_MAX && !tcp->limited) {
 		static_key_slow_inc(&memcg_socket_limit_enabled);
+		tcp->limited = true;
+	}
 
 	return 0;
 }
-- 
1.7.7.6