diff mbox

[5/8] mm: memcontrol: account socket memory on unified hierarchy

Message ID 1445487696-21545-6-git-send-email-hannes@cmpxchg.org
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Johannes Weiner Oct. 22, 2015, 4:21 a.m. UTC
Socket memory can be a significant share of overall memory consumed by
common workloads. In order to provide reasonable resource isolation
out-of-the-box in the unified hierarchy, this type of memory needs to
be accounted and tracked per default in the memory controller.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 16 ++++++--
 mm/memcontrol.c            | 95 ++++++++++++++++++++++++++++++++++++----------
 2 files changed, 87 insertions(+), 24 deletions(-)

Comments

Vladimir Davydov Oct. 22, 2015, 6:47 p.m. UTC | #1
On Thu, Oct 22, 2015 at 12:21:33AM -0400, Johannes Weiner wrote:
...
> @@ -5500,13 +5524,38 @@ void sock_release_memcg(struct sock *sk)
>   */
>  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> +	unsigned int batch = max(CHARGE_BATCH, nr_pages);
>  	struct page_counter *counter;
> +	bool force = false;
>  
> -	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> +		if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> +			return true;
> +		page_counter_charge(&memcg->skmem, nr_pages);
> +		return false;
> +	}
> +
> +	if (consume_stock(memcg, nr_pages))
>  		return true;
> +retry:
> +	if (page_counter_try_charge(&memcg->memory, batch, &counter))
> +		goto done;

Currently, we use memcg->memory only for charging memory pages. Besides,
every page charged to this counter (including kmem) has ->mem_cgroup
field set appropriately. This looks consistent and nice. As an extra
benefit, we can track all pages charged to a memory cgroup via
/proc/kapgecgroup.

Now, you charge "window size" to it, which AFAIU isn't necessarily equal
to the amount of memory actually consumed by the cgroup for socket
buffers. I think this looks ugly and inconsistent with the existing
behavior. I agree that we need to charge socker buffers to ->memory, but
IMO we should do that per each skb page, using memcg_kmem_charge_kmem
somewhere in alloc_skb_with_frags invoking the reclaimer just as we do
for kmalloc, while tcp window size control should stay aside.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko Oct. 23, 2015, 1:19 p.m. UTC | #2
On Thu 22-10-15 00:21:33, Johannes Weiner wrote:
> Socket memory can be a significant share of overall memory consumed by
> common workloads. In order to provide reasonable resource isolation
> out-of-the-box in the unified hierarchy, this type of memory needs to
> be accounted and tracked per default in the memory controller.

What about users who do not want to pay an additional overhead for the
accounting? How can they disable it?

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

[...]

> @@ -5453,10 +5470,9 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage)
>  	commit_charge(newpage, memcg, true);
>  }
>  
> -/* Writing them here to avoid exposing memcg's inner layout */
> -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> +#ifdef CONFIG_INET
>  
> -DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
> +DEFINE_STATIC_KEY_TRUE(mem_cgroup_sockets);

AFAIU this means that the jump label is enabled by default. Is this
intended when you enable it explicitly where needed?

>  
>  void sock_update_memcg(struct sock *sk)
>  {
David Miller Oct. 23, 2015, 1:59 p.m. UTC | #3
From: Michal Hocko <mhocko@kernel.org>
Date: Fri, 23 Oct 2015 15:19:56 +0200

> On Thu 22-10-15 00:21:33, Johannes Weiner wrote:
>> Socket memory can be a significant share of overall memory consumed by
>> common workloads. In order to provide reasonable resource isolation
>> out-of-the-box in the unified hierarchy, this type of memory needs to
>> be accounted and tracked per default in the memory controller.
> 
> What about users who do not want to pay an additional overhead for the
> accounting? How can they disable it?

Yeah, this really cannot pass.

This extra overhead will be seen by %99.9999 of users, since entities
(especially distributions) just flip on all of these config options by
default.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Weiner Oct. 26, 2015, 4:56 p.m. UTC | #4
On Fri, Oct 23, 2015 at 06:59:57AM -0700, David Miller wrote:
> From: Michal Hocko <mhocko@kernel.org>
> Date: Fri, 23 Oct 2015 15:19:56 +0200
> 
> > On Thu 22-10-15 00:21:33, Johannes Weiner wrote:
> >> Socket memory can be a significant share of overall memory consumed by
> >> common workloads. In order to provide reasonable resource isolation
> >> out-of-the-box in the unified hierarchy, this type of memory needs to
> >> be accounted and tracked per default in the memory controller.
> > 
> > What about users who do not want to pay an additional overhead for the
> > accounting? How can they disable it?
> 
> Yeah, this really cannot pass.
> 
> This extra overhead will be seen by %99.9999 of users, since entities
> (especially distributions) just flip on all of these config options by
> default.

Okay, there are several layers to this issue.

If you boot a machine with a CONFIG_MEMCG distribution kernel and
don't create any cgroups, I agree there shouldn't be any overhead.

I already sent a patch to generally remove memory accounting on the
system or root level. I can easily update this patch here to not have
any socket buffer accounting overhead for systems that don't actively
use cgroups. Would you be okay with a branch on sk->sk_memcg in the
network accounting path? I'd leave that NULL on the system level then.

Then there is of course the case when you create cgroups for process
organization but don't care about memory accounting. Systemd comes to
mind. Or even if you create cgroups to track other resources like CPU
but don't care about memory. The unified hierarchy no longer enables
controllers on new cgroups per default, so unless you create a cgroup
and specifically tell it to account and track memory, you won't have
the socket memory accounting overhead, either.

Then there is the third case, where you create a control group to
specifically manage and limit the memory consumption of a workload. In
that scenario, a major memory consumer like socket buffers, which can
easily grow until OOM, should definitely be included in the tracking
in order to properly contain both untrusted (possibly malicious) and
trusted (possibly buggy) workloads. This is not a hole we can
reasonbly leave unpatched for general purpose resource management.

Now you could argue that there might exist specialized workloads that
need to account anonymous pages and page cache, but not socket memory
buffers. Or any other combination of pick-and-choose consumers. But
honestly, nowadays all our paths are lockless, and the counting is an
atomic-add-return with a per-cpu batch cache. I don't think there is a
compelling case for an elaborate interface to make individual memory
consumers configurable inside the memory controller.

So in summary, would you be okay with this patch if networking only
called into the memory controller when you explicitely create a cgroup
AND tell it to track the memory footprint of the workload in it?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko Oct. 27, 2015, 12:26 p.m. UTC | #5
On Mon 26-10-15 12:56:19, Johannes Weiner wrote:
[...]
> Now you could argue that there might exist specialized workloads that
> need to account anonymous pages and page cache, but not socket memory
> buffers.

Exactly, and there are loads doing this. Memcg groups are also created to
limit anon/page cache consumers to not affect the others running on
the system (basically in the root memcg context from memcg POV) which
don't care about tracking and they definitely do not want to pay for an
additional overhead. We should definitely be able to offer a global
disable knob for them. The same applies to kmem accounting in general.

I do understand with having the accounting enabled by default after we
are reasonably sure that both kmem/tcp are stable enough (which I am
not convinced about yet to be honest) but there will be always special
loads which simply do not care about kmem/tcp accounting and rather pay
a global balancing price (even OOM) rather than a permanent price. And
they should get a way to opt-out.

> Or any other combination of pick-and-choose consumers. But
> honestly, nowadays all our paths are lockless, and the counting is an
> atomic-add-return with a per-cpu batch cache.

You are still hooking into hot paths and there are users who want to
squeeze every single cycle from the HW.

> I don't think there is a compelling case for an elaborate interface
> to make individual memory consumers configurable inside the memory
> controller.

I do not think we need an elaborate interface. We just want to have
a global boot time knob to overwrite the default behavior. This is
few lines of code and it should give the sufficient flexibility.
David Miller Oct. 27, 2015, 1:49 p.m. UTC | #6
From: Michal Hocko <mhocko@kernel.org>
Date: Tue, 27 Oct 2015 13:26:47 +0100

> On Mon 26-10-15 12:56:19, Johannes Weiner wrote:
> [...]
>> Or any other combination of pick-and-choose consumers. But
>> honestly, nowadays all our paths are lockless, and the counting is an
>> atomic-add-return with a per-cpu batch cache.
> 
> You are still hooking into hot paths and there are users who want to
> squeeze every single cycle from the HW.

Yeah, you're basically probably undoing a half year of work by another
developer who was able to remove an atomic from these paths.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Weiner Oct. 27, 2015, 3:41 p.m. UTC | #7
On Tue, Oct 27, 2015 at 01:26:47PM +0100, Michal Hocko wrote:
> On Mon 26-10-15 12:56:19, Johannes Weiner wrote:
> [...]
> > Now you could argue that there might exist specialized workloads that
> > need to account anonymous pages and page cache, but not socket memory
> > buffers.
> 
> Exactly, and there are loads doing this. Memcg groups are also created to
> limit anon/page cache consumers to not affect the others running on
> the system (basically in the root memcg context from memcg POV) which
> don't care about tracking and they definitely do not want to pay for an
> additional overhead. We should definitely be able to offer a global
> disable knob for them. The same applies to kmem accounting in general.

I don't see how you make such a clear distinction between, say, page
cache and the dentry cache, and call one user memory and the other
kernel memory. That just doesn't make sense to me. They're both kernel
memory allocated on behalf of the user, the only difference being that
one is tracked on the page level and the other on the slab level, and
we started accounting one before the other.

IMO that's an implementation detail and a historical artifact that
should not be exposed to the user. And that's the thing I hate about
the current opt-out knob.

> > I don't think there is a compelling case for an elaborate interface
> > to make individual memory consumers configurable inside the memory
> > controller.
> 
> I do not think we need an elaborate interface. We just want to have
> a global boot time knob to overwrite the default behavior. This is
> few lines of code and it should give the sufficient flexibility.

Okay, then let's add this for the socket memory to start with. I'll
have to think more about how to distinguish the slab-based consumers.
Or maybe you have an idea.

For now, something like this as a boot commandline?

cgroup.memory=nosocket

So again in summary, no default overhead until you create a cgroup to
specifically track and account memory. And then, when you know what
you are doing and have a specialized workload, you can disable socket
memory as a specific consumer to remove that particular overhead while
still being able to contain page cache, anon, kmem, whatever.

Does that sound like reasonable userinterfacing to everyone?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko Oct. 27, 2015, 4:15 p.m. UTC | #8
On Tue 27-10-15 11:41:38, Johannes Weiner wrote:
> On Tue, Oct 27, 2015 at 01:26:47PM +0100, Michal Hocko wrote:
> > On Mon 26-10-15 12:56:19, Johannes Weiner wrote:
> > [...]
> > > Now you could argue that there might exist specialized workloads that
> > > need to account anonymous pages and page cache, but not socket memory
> > > buffers.
> > 
> > Exactly, and there are loads doing this. Memcg groups are also created to
> > limit anon/page cache consumers to not affect the others running on
> > the system (basically in the root memcg context from memcg POV) which
> > don't care about tracking and they definitely do not want to pay for an
> > additional overhead. We should definitely be able to offer a global
> > disable knob for them. The same applies to kmem accounting in general.
> 
> I don't see how you make such a clear distinction between, say, page
> cache and the dentry cache, and call one user memory and the other
> kernel memory.

Because the kernel memory footprint would be so small that it simply
doesn't change the picture at all. While the page cache or anonymous
memory consumption might be so large it might be disruptive.  I am
talking about loads where good enough is better than "perfect" and
ephemeral global memory pressure when kmem goes over expectations is
better than a permanent cpu overhead. Whatever we do it will always
be non-zero.

Also kmem accounting will make the load more non-deterministic because
many of the resources are shared between tasks in separate cgroups
unless they are explicitly configured. E.g. [id]cache will be shared
and first to touch gets charged so you would end up with more false
sharing.

Nevertheless, I do not want to shift the discussion from the topic. I
just think that one-fits-all simply won't work.

> That just doesn't make sense to me. They're both kernel
> memory allocated on behalf of the user, the only difference being that
> one is tracked on the page level and the other on the slab level, and
> we started accounting one before the other.
> 
> IMO that's an implementation detail and a historical artifact that
> should not be exposed to the user. And that's the thing I hate about
> the current opt-out knob.
> 
> > > I don't think there is a compelling case for an elaborate interface
> > > to make individual memory consumers configurable inside the memory
> > > controller.
> > 
> > I do not think we need an elaborate interface. We just want to have
> > a global boot time knob to overwrite the default behavior. This is
> > few lines of code and it should give the sufficient flexibility.
> 
> Okay, then let's add this for the socket memory to start with. I'll
> have to think more about how to distinguish the slab-based consumers.
> Or maybe you have an idea.

Isn't that as simple as enabling the jump label during the
initialization depending on the knob value? All the charging paths
should be disabled by default already.

> For now, something like this as a boot commandline?
> 
> cgroup.memory=nosocket

That would work for me. I would even see a place to have
CONFIG_MEMCG_TCP_KMEM_ENABLED config option for the default and
[no]socket as a kernel parameter to override the configuratioin default.
This would allow distributions to define their policy without enforcing
it hard and those who compile the kernel to define their own policy.
Johannes Weiner Oct. 27, 2015, 4:42 p.m. UTC | #9
On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote:
> On Tue 27-10-15 11:41:38, Johannes Weiner wrote:
> > On Tue, Oct 27, 2015 at 01:26:47PM +0100, Michal Hocko wrote:
> > > On Mon 26-10-15 12:56:19, Johannes Weiner wrote:
> > > [...]
> > > > Now you could argue that there might exist specialized workloads that
> > > > need to account anonymous pages and page cache, but not socket memory
> > > > buffers.
> > > 
> > > Exactly, and there are loads doing this. Memcg groups are also created to
> > > limit anon/page cache consumers to not affect the others running on
> > > the system (basically in the root memcg context from memcg POV) which
> > > don't care about tracking and they definitely do not want to pay for an
> > > additional overhead. We should definitely be able to offer a global
> > > disable knob for them. The same applies to kmem accounting in general.
> > 
> > I don't see how you make such a clear distinction between, say, page
> > cache and the dentry cache, and call one user memory and the other
> > kernel memory.
> 
> Because the kernel memory footprint would be so small that it simply
> doesn't change the picture at all. While the page cache or anonymous
> memory consumption might be so large it might be disruptive.

Or it could be exactly the other way around when you have a workload
that is heavy on filesystem metadata. I don't see why any scenario
would be more important than the other.

I'm not saying that distinguishing between consumers is wrong, just
that "user memory vs kernel memory" is a false classification. Why do
you call page cache user memory but dentry cache kernel memory? It
doesn't make any sense.

> Also kmem accounting will make the load more non-deterministic because
> many of the resources are shared between tasks in separate cgroups
> unless they are explicitly configured. E.g. [id]cache will be shared
> and first to touch gets charged so you would end up with more false
> sharing.

Exactly like page cache. This differentiation isn't based on reality.

> Nevertheless, I do not want to shift the discussion from the topic. I
> just think that one-fits-all simply won't work.

Okay, this is something we can converge on.

> > That just doesn't make sense to me. They're both kernel
> > memory allocated on behalf of the user, the only difference being that
> > one is tracked on the page level and the other on the slab level, and
> > we started accounting one before the other.
> > 
> > IMO that's an implementation detail and a historical artifact that
> > should not be exposed to the user. And that's the thing I hate about
> > the current opt-out knob.

You carefully skipped over this part. We can ignore it for socket
memory but it's something we need to figure out when it comes to slab
accounting and tracking.

> > > > I don't think there is a compelling case for an elaborate interface
> > > > to make individual memory consumers configurable inside the memory
> > > > controller.
> > > 
> > > I do not think we need an elaborate interface. We just want to have
> > > a global boot time knob to overwrite the default behavior. This is
> > > few lines of code and it should give the sufficient flexibility.
> > 
> > Okay, then let's add this for the socket memory to start with. I'll
> > have to think more about how to distinguish the slab-based consumers.
> > Or maybe you have an idea.
> 
> Isn't that as simple as enabling the jump label during the
> initialization depending on the knob value? All the charging paths
> should be disabled by default already.

You missed my point. It's not about the implementation, it's about how
we present these choices to the user.

Having page cache accounting built in while presenting dentry+inode
cache as a configurable extension is completely random and doesn't
make sense. They are both first class memory consumers. They're not
separate categories. One isn't more "core" than the other.

> > For now, something like this as a boot commandline?
> > 
> > cgroup.memory=nosocket
> 
> That would work for me.

Okay, then I'll go that route for the socket stuff.

Dave is that cool with you?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 28, 2015, 12:45 a.m. UTC | #10
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Tue, 27 Oct 2015 09:42:27 -0700

> On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote:
>> > For now, something like this as a boot commandline?
>> > 
>> > cgroup.memory=nosocket
>> 
>> That would work for me.
> 
> Okay, then I'll go that route for the socket stuff.
> 
> Dave is that cool with you?

Depends upon the default.

Until the user configures something explicitly into the memory
controller, the networking bits should all evaluate to nothing.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Weiner Oct. 28, 2015, 3:05 a.m. UTC | #11
On Tue, Oct 27, 2015 at 05:45:32PM -0700, David Miller wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Tue, 27 Oct 2015 09:42:27 -0700
> 
> > On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote:
> >> > For now, something like this as a boot commandline?
> >> > 
> >> > cgroup.memory=nosocket
> >> 
> >> That would work for me.
> > 
> > Okay, then I'll go that route for the socket stuff.
> > 
> > Dave is that cool with you?
> 
> Depends upon the default.
> 
> Until the user configures something explicitly into the memory
> controller, the networking bits should all evaluate to nothing.

Yep, I'll stick them behind a default-off jump label again.

This bootflag is only to override an active memory controller
configuration and force-off that jump label permanently.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko Oct. 29, 2015, 3:25 p.m. UTC | #12
On Tue 27-10-15 09:42:27, Johannes Weiner wrote:
> On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote:
> > On Tue 27-10-15 11:41:38, Johannes Weiner wrote:
[...]
> Or it could be exactly the other way around when you have a workload
> that is heavy on filesystem metadata. I don't see why any scenario
> would be more important than the other.

Yes I definitely agree. No scenario is more important. We can only
come up with a default that makes more sense for the majority and
allow the minority to override. That was what I wanted to say basically.

> I'm not saying that distinguishing between consumers is wrong, just
> that "user memory vs kernel memory" is a false classification. Why do
> you call page cache user memory but dentry cache kernel memory? It
> doesn't make any sense.

We are not talking about dcache vs. page cache alone here, though. We
are talking about _all_ slab allocations vs. only user accessed memory.
The slab consumption is directly under kernel control. A great pile of
this logic is completly hidden from userspace. While user can estimate
the user memory it is hard (if possible) to do that for the kernel
memory footprint - not even mentioning this is variable and dependent on
the particular kernel version.

> > Also kmem accounting will make the load more non-deterministic because
> > many of the resources are shared between tasks in separate cgroups
> > unless they are explicitly configured. E.g. [id]cache will be shared
> > and first to touch gets charged so you would end up with more false
> > sharing.
> 
> Exactly like page cache. This differentiation isn't based on reality.

Yes false sharing is an existing and long term problem already. I just
wanted to point out that the false sharing would be even a bigger
problem because some kernel tracked resources are shared more naturally
than file sharing.

> > > IMO that's an implementation detail and a historical artifact that
> > > should not be exposed to the user. And that's the thing I hate about
> > > the current opt-out knob.
> 
> You carefully skipped over this part. We can ignore it for socket
> memory but it's something we need to figure out when it comes to slab
> accounting and tracking.

I am sorry, I didn't mean to skip this part, I though it would be clear
from the previous text. I think kmem accounting falls into the same
category. Have a sane default and a global boottime knob to override it
for those that think differently - for whatever reason they might have.

[...]
 
> Having page cache accounting built in while presenting dentry+inode
> cache as a configurable extension is completely random and doesn't
> make sense. They are both first class memory consumers. They're not
> separate categories. One isn't more "core" than the other.

Again we are talking about all slab allocations not just the dcache. 

> > > For now, something like this as a boot commandline?
> > > 
> > > cgroup.memory=nosocket
> > 
> > That would work for me.
> 
> Okay, then I'll go that route for the socket stuff.

Thanks!
Johannes Weiner Oct. 29, 2015, 4:10 p.m. UTC | #13
On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote:
> On Tue 27-10-15 09:42:27, Johannes Weiner wrote:
> > On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote:
> > > On Tue 27-10-15 11:41:38, Johannes Weiner wrote:
> > > > IMO that's an implementation detail and a historical artifact that
> > > > should not be exposed to the user. And that's the thing I hate about
> > > > the current opt-out knob.
> > 
> > You carefully skipped over this part. We can ignore it for socket
> > memory but it's something we need to figure out when it comes to slab
> > accounting and tracking.
> 
> I am sorry, I didn't mean to skip this part, I though it would be clear
> from the previous text. I think kmem accounting falls into the same
> category. Have a sane default and a global boottime knob to override it
> for those that think differently - for whatever reason they might have.

Yes, that makes sense to me.

Like cgroup.memory=nosocket, would you think it makes sense to include
slab in the default for functional/semantical completeness and provide
a cgroup.memory=noslab for powerusers?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko Nov. 4, 2015, 10:42 a.m. UTC | #14
On Thu 29-10-15 09:10:09, Johannes Weiner wrote:
> On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote:
> > On Tue 27-10-15 09:42:27, Johannes Weiner wrote:
[...]
> > > You carefully skipped over this part. We can ignore it for socket
> > > memory but it's something we need to figure out when it comes to slab
> > > accounting and tracking.
> > 
> > I am sorry, I didn't mean to skip this part, I though it would be clear
> > from the previous text. I think kmem accounting falls into the same
> > category. Have a sane default and a global boottime knob to override it
> > for those that think differently - for whatever reason they might have.
> 
> Yes, that makes sense to me.
> 
> Like cgroup.memory=nosocket, would you think it makes sense to include
> slab in the default for functional/semantical completeness and provide
> a cgroup.memory=noslab for powerusers?

I am still not sure whether the kmem accounting is stable enough to be
enabled by default. If for nothing else the allocation failures, which
are not allowed for the global case and easily triggered by the hard
limit, might be a big problem. My last attempts to allow GFP_NOFS to
fail made me quite skeptical. I still believe this is something which
will be solved in the long term but the current state might be still too
fragile. So I would rather be conservative and have the kmem accounting
disabled by default with a config option and boot parameter to override.
If somebody is confident that the desired load is stable then the config
can be enabled easily.
Johannes Weiner Nov. 4, 2015, 7:50 p.m. UTC | #15
On Wed, Nov 04, 2015 at 11:42:40AM +0100, Michal Hocko wrote:
> On Thu 29-10-15 09:10:09, Johannes Weiner wrote:
> > On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote:
> > > On Tue 27-10-15 09:42:27, Johannes Weiner wrote:
> [...]
> > > > You carefully skipped over this part. We can ignore it for socket
> > > > memory but it's something we need to figure out when it comes to slab
> > > > accounting and tracking.
> > > 
> > > I am sorry, I didn't mean to skip this part, I though it would be clear
> > > from the previous text. I think kmem accounting falls into the same
> > > category. Have a sane default and a global boottime knob to override it
> > > for those that think differently - for whatever reason they might have.
> > 
> > Yes, that makes sense to me.
> > 
> > Like cgroup.memory=nosocket, would you think it makes sense to include
> > slab in the default for functional/semantical completeness and provide
> > a cgroup.memory=noslab for powerusers?
> 
> I am still not sure whether the kmem accounting is stable enough to be
> enabled by default. If for nothing else the allocation failures, which
> are not allowed for the global case and easily triggered by the hard
> limit, might be a big problem. My last attempts to allow GFP_NOFS to
> fail made me quite skeptical. I still believe this is something which
> will be solved in the long term but the current state might be still too
> fragile. So I would rather be conservative and have the kmem accounting
> disabled by default with a config option and boot parameter to override.
> If somebody is confident that the desired load is stable then the config
> can be enabled easily.

I agree with your assessment of the current kmem code state, but I
think your conclusion is completely backwards here.

The interface will be set in stone forever, whereas any stability
issues will be transient and will have to be addressed in a finite
amount of time anyway. It doesn't make sense to design an interface
based on temporary quality of implementation. Only one of those two
can ever be changed.

Because it goes without saying that once the cgroupv2 interface is
released, and people use it in production, there is no way we can then
*add* dentry cache, inode cache, and others to memory.current. That
would be an unacceptable change in interface behavior. On the other
hand, people will be prepared for hiccups in the early stages of
cgroupv2 release, and we're providing cgroup.memory=noslab to let them
workaround severe problems in production until we fix it without
forcing them to fully revert to cgroupv1.

So if we agree that there are no fundamental architectural concerns
with slab accounting, i.e. nothing that can't be addressed in the
implementation, we have to make the call now.

And I maintain that not accounting dentry cache and inode cache is a
gaping hole in memory isolation, so it should be included by default.
(The rest of the slabs is arguable, but IMO the risk of missing
something important is higher than the cost of including them.)


As far as your allocation failure concerns go, I think the kmem code
is currently not behaving as Glauber originally intended, which is to
force charge if reclaim and OOM killing weren't able to make enough
space. See this recently rewritten section of the kmem charge path:

-               /*
-                * try_charge() chose to bypass to root due to OOM kill or
-                * fatal signal.  Since our only options are to either fail
-                * the allocation or charge it to this cgroup, do it as a
-                * temporary condition. But we can't fail. From a kmem/slab
-                * perspective, the cache has already been selected, by
-                * mem_cgroup_kmem_get_cache(), so it is too late to change
-                * our minds.
-                *
-                * This condition will only trigger if the task entered
-                * memcg_charge_kmem in a sane state, but was OOM-killed
-                * during try_charge() above. Tasks that were already dying
-                * when the allocation triggers should have been already
-                * directed to the root cgroup in memcontrol.h
-                */
-               page_counter_charge(&memcg->memory, nr_pages);
-               if (do_swap_account)
-                       page_counter_charge(&memcg->memsw, nr_pages);

It could be that this never properly worked as it was tied to the
-EINTR bypass trick, but the idea was these charges never fail.

And this makes sense. If the allocator semantics are such that we
never fail these page allocations for slab, and the callsites rely on
that, surely we should not fail them in the memory controller, either.

And it makes a lot more sense to account them in excess of the limit
than pretend they don't exist. We might not be able to completely
fullfill the containment part of the memory controller (although these
slab charges will still create significant pressure before that), but
at least we don't fail the accounting part on top of it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko Nov. 5, 2015, 2:40 p.m. UTC | #16
On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
[...]
> Because it goes without saying that once the cgroupv2 interface is
> released, and people use it in production, there is no way we can then
> *add* dentry cache, inode cache, and others to memory.current. That
> would be an unacceptable change in interface behavior.

They would still have to _enable_ the config option _explicitly_. make
oldconfig wouldn't change it silently for them. I do not think
it is an unacceptable change of behavior if the config is changed
explicitly.

> On the other
> hand, people will be prepared for hiccups in the early stages of
> cgroupv2 release, and we're providing cgroup.memory=noslab to let them
> workaround severe problems in production until we fix it without
> forcing them to fully revert to cgroupv1.

This would be true if they moved on to the new cgroup API intentionally.
The reality is more complicated though. AFAIK sysmted is waiting for
cgroup2 already and privileged services enable all available resource
controllers by default as I've learned just recently. If we know that
the interface is not stable enough then we are basically forcing _most_
users to use the kernel boot parameter if we stay with the current
kmem semantic. More on that below.

> So if we agree that there are no fundamental architectural concerns
> with slab accounting, i.e. nothing that can't be addressed in the
> implementation, we have to make the call now.

We are on the same page here.

> And I maintain that not accounting dentry cache and inode cache is a
> gaping hole in memory isolation, so it should be included by default.
> (The rest of the slabs is arguable, but IMO the risk of missing
> something important is higher than the cost of including them.)

More on that below.
 
> As far as your allocation failure concerns go, I think the kmem code
> is currently not behaving as Glauber originally intended, which is to
> force charge if reclaim and OOM killing weren't able to make enough
> space. See this recently rewritten section of the kmem charge path:
> 
> -               /*
> -                * try_charge() chose to bypass to root due to OOM kill or
> -                * fatal signal.  Since our only options are to either fail
> -                * the allocation or charge it to this cgroup, do it as a
> -                * temporary condition. But we can't fail. From a kmem/slab
> -                * perspective, the cache has already been selected, by
> -                * mem_cgroup_kmem_get_cache(), so it is too late to change
> -                * our minds.
> -                *
> -                * This condition will only trigger if the task entered
> -                * memcg_charge_kmem in a sane state, but was OOM-killed
> -                * during try_charge() above. Tasks that were already dying
> -                * when the allocation triggers should have been already
> -                * directed to the root cgroup in memcontrol.h
> -                */
> -               page_counter_charge(&memcg->memory, nr_pages);
> -               if (do_swap_account)
> -                       page_counter_charge(&memcg->memsw, nr_pages);
> 
> It could be that this never properly worked as it was tied to the
> -EINTR bypass trick, but the idea was these charges never fail.

I have always understood this path as a corner case when the task is an
oom victim or exiting. So this would be only a temporal condition which
cannot cause a complete runaway.

> And this makes sense. If the allocator semantics are such that we
> never fail these page allocations for slab, and the callsites rely on
> that, surely we should not fail them in the memory controller, either.

Then we can only bypass them or loop inside the charge code for ever
like we do in the page allocator. The later one is really fragile and
it would be much more in the restricted environment as we have learned
with the memcg OOM killer in the past. 
 
> And it makes a lot more sense to account them in excess of the limit
> than pretend they don't exist. We might not be able to completely
> fullfill the containment part of the memory controller (although these
> slab charges will still create significant pressure before that), but
> at least we don't fail the accounting part on top of it.

Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any
load could simply runaway via kernel allocations. What is even worse we
might even not trigger memcg OOM killer before we hit the global OOM. So
the whole containment goes straight to hell.

I can see four options here:
1) enable kmem by default with the current semantic which we know can
   BUG_ON (at least btrfs is known to hit this) or lead to other issues.
2) enable kmem by default and change the semantic for cgroup2 to allow
   runaway charges above the hard limit which would defeat the whole
   purpose of the containment for cgroup2. This can be a temporary
   workaround until we can afford kmem failures. This has a big risk
   that we will end up with this permanently because there is a strong
   pressure that GFP_KERNEL allocations should never fail. Yet this is
   the most common type of request. Or do we change the consistency with
   the global case at some point?
3) keep only some (safe) cache types enabled by default with the current
   failing semantic and require an explicit enabling for the complete
   kmem accounting. [di]cache code paths should be quite robust to
   handle allocation failures.
4) disable kmem by default and change the config default later to signal
   the accounting is safe as far as we are aware and let people enable
   the functionality on those basis. We would keep the current failing
   semantic.

To me 4) sounds like the safest option because it still keeps the
functionality available to those who can benefit from it in v1 already
while we are not exposing a potentially buggy behavior to the majority
(many of them even unintentionally).  Moreover we still allow to change
the default later on an explicit basis. 3) sounds like the second best
option but I am not really sure whether we can do that very easily
without bringing up a lot of unmaintainable mess. 2) sounds like the
third best approach but I am afraid it would render the basic use cases
unusable for a very long time and kill any interest in cgroup2 for even
longer (cargo cults are really hard to get rid of). 1) sounds like a
land mine approach which would force many/most users to simply keep
using the boot option and force us to re-evaluate the default hard way.
David Miller Nov. 5, 2015, 4:16 p.m. UTC | #17
From: Michal Hocko <mhocko@kernel.org>
Date: Thu, 5 Nov 2015 15:40:02 +0100

> On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> [...]
>> Because it goes without saying that once the cgroupv2 interface is
>> released, and people use it in production, there is no way we can then
>> *add* dentry cache, inode cache, and others to memory.current. That
>> would be an unacceptable change in interface behavior.
> 
> They would still have to _enable_ the config option _explicitly_. make
> oldconfig wouldn't change it silently for them. I do not think
> it is an unacceptable change of behavior if the config is changed
> explicitly.

Every user is going to get this config option when they update their
distibution kernel or whatever.

Then they will all wonder why their networking performance went down.

This is why I do not want the networking accounting bits on by default
even if the kconfig option is enabled.  They must be off by default
and guarded by a static branch so the cost is exactly zero.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko Nov. 5, 2015, 4:28 p.m. UTC | #18
On Thu 05-11-15 11:16:09, David S. Miller wrote:
> From: Michal Hocko <mhocko@kernel.org>
> Date: Thu, 5 Nov 2015 15:40:02 +0100
> 
> > On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> > [...]
> >> Because it goes without saying that once the cgroupv2 interface is
> >> released, and people use it in production, there is no way we can then
> >> *add* dentry cache, inode cache, and others to memory.current. That
> >> would be an unacceptable change in interface behavior.
> > 
> > They would still have to _enable_ the config option _explicitly_. make
> > oldconfig wouldn't change it silently for them. I do not think
> > it is an unacceptable change of behavior if the config is changed
> > explicitly.
> 
> Every user is going to get this config option when they update their
> distibution kernel or whatever.
> 
> Then they will all wonder why their networking performance went down.
> 
> This is why I do not want the networking accounting bits on by default
> even if the kconfig option is enabled.  They must be off by default
> and guarded by a static branch so the cost is exactly zero.

Yes, that part is clear and Johannes made it clear that the kmem tcp
part is disabled by default. Or are you considering also all the slab
usage by the networking code as well?
David Miller Nov. 5, 2015, 4:30 p.m. UTC | #19
From: Michal Hocko <mhocko@kernel.org>
Date: Thu, 5 Nov 2015 17:28:03 +0100

> Yes, that part is clear and Johannes made it clear that the kmem tcp
> part is disabled by default. Or are you considering also all the slab
> usage by the networking code as well?

I'm still thinking about the implications of that aspect, and will
comment when I have something coherent to say about it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Weiner Nov. 5, 2015, 8:55 p.m. UTC | #20
On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> [...]
> > Because it goes without saying that once the cgroupv2 interface is
> > released, and people use it in production, there is no way we can then
> > *add* dentry cache, inode cache, and others to memory.current. That
> > would be an unacceptable change in interface behavior.
> 
> They would still have to _enable_ the config option _explicitly_. make
> oldconfig wouldn't change it silently for them. I do not think
> it is an unacceptable change of behavior if the config is changed
> explicitly.

Yeah, as Dave said these will all get turned on anyway, so there is no
point in fragmenting the Kconfig space in the first place.

> > On the other
> > hand, people will be prepared for hiccups in the early stages of
> > cgroupv2 release, and we're providing cgroup.memory=noslab to let them
> > workaround severe problems in production until we fix it without
> > forcing them to fully revert to cgroupv1.
> 
> This would be true if they moved on to the new cgroup API intentionally.
> The reality is more complicated though. AFAIK sysmted is waiting for
> cgroup2 already and privileged services enable all available resource
> controllers by default as I've learned just recently.

Have you filed a report with them? I don't think they should turn them
on unless users explicitely configure resource control for the unit.

But what I said still holds: critical production machines don't just
get rolling updates and "accidentally" switch to all this new
code. And those that do take the plunge have the cmdline options.

> > And it makes a lot more sense to account them in excess of the limit
> > than pretend they don't exist. We might not be able to completely
> > fullfill the containment part of the memory controller (although these
> > slab charges will still create significant pressure before that), but
> > at least we don't fail the accounting part on top of it.
> 
> Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any
> load could simply runaway via kernel allocations. What is even worse we
> might even not trigger memcg OOM killer before we hit the global OOM. So
> the whole containment goes straight to hell.
>
> I can see four options here:
> 1) enable kmem by default with the current semantic which we know can
>    BUG_ON (at least btrfs is known to hit this) or lead to other issues.

Can you point me to that report?

That's not "semantics", that's a bug! Whether or not a feature is
enabled by default, it can not be allowed to crash the kernel.

Presenting this as a choice is a bit of a strawman argument.

> 2) enable kmem by default and change the semantic for cgroup2 to allow
>    runaway charges above the hard limit which would defeat the whole
>    purpose of the containment for cgroup2. This can be a temporary
>    workaround until we can afford kmem failures. This has a big risk
>    that we will end up with this permanently because there is a strong
>    pressure that GFP_KERNEL allocations should never fail. Yet this is
>    the most common type of request. Or do we change the consistency with
>    the global case at some point?

As per 1) we *have* to fail containment eventually if not doing so
means crashes and lockups. That's not a choice of semantics.

But that doesn't mean we have to give up *immediately* and allow
unrestrained "runaway charges"--again, more of a strawman than a
choice. We can still throttle the allocator and apply significant
pressure on the memory pool, culminating in OOM kills eventually.

Once we run out of available containment tools, however, we *have* to
follow the semantics of the page and slab allocator and succeed the
request. We can not just return -ENOMEM if that causes kernel bugs.

That's the only thing we can do right now.

In fact, it's likely going to be the best we will ever be able to do
when it comes to kernel memory accounting. Linus made it clear where
he stands on failing kernel allocations, so all we can do is continue
to improve our containment tools and then give up on containment when
they're exhausted and force the charge past the limit.

> 3) keep only some (safe) cache types enabled by default with the current
>    failing semantic and require an explicit enabling for the complete
>    kmem accounting. [di]cache code paths should be quite robust to
>    handle allocation failures.

Vladimir, what would be your opinion on this?

> 4) disable kmem by default and change the config default later to signal
>    the accounting is safe as far as we are aware and let people enable
>    the functionality on those basis. We would keep the current failing
>    semantic.
> 
> To me 4) sounds like the safest option because it still keeps the
> functionality available to those who can benefit from it in v1 already
> while we are not exposing a potentially buggy behavior to the majority
> (many of them even unintentionally). Moreover we still allow to change
> the default later on an explicit basis.

I'm not interested in fragmenting the interface forever out of caution
because there might be a bug in the implementation right now. As I
said we have to fix any instability in the features we provide whether
they are turned on by default or not. I don't see how this is relevant
to the interface discussion.

Also, there is no way we can later fundamentally change the semantics
of memory.current, so it would have to remain configurable forever,
forcing people forever to select multiple options in order to piece
together a single logical kernel feature.

This is really not an option, either.

If there are show-stopping bugs in the implementation, I'd rather hold
off the release of the unified hierarchy than commit to a half-assed
interface right out of the gate. The point of v2 is sane interfaces.

So let's please focus on fixing any problems that slab accounting may
have, rather than designing complex config options and transition
procedures whose sole purpose is to defer dealing with our issues.

Please?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Weiner Nov. 5, 2015, 10:32 p.m. UTC | #21
On Thu, Nov 05, 2015 at 05:28:03PM +0100, Michal Hocko wrote:
> On Thu 05-11-15 11:16:09, David S. Miller wrote:
> > From: Michal Hocko <mhocko@kernel.org>
> > Date: Thu, 5 Nov 2015 15:40:02 +0100
> > 
> > > On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> > > [...]
> > >> Because it goes without saying that once the cgroupv2 interface is
> > >> released, and people use it in production, there is no way we can then
> > >> *add* dentry cache, inode cache, and others to memory.current. That
> > >> would be an unacceptable change in interface behavior.
> > > 
> > > They would still have to _enable_ the config option _explicitly_. make
> > > oldconfig wouldn't change it silently for them. I do not think
> > > it is an unacceptable change of behavior if the config is changed
> > > explicitly.
> > 
> > Every user is going to get this config option when they update their
> > distibution kernel or whatever.
> > 
> > Then they will all wonder why their networking performance went down.
> > 
> > This is why I do not want the networking accounting bits on by default
> > even if the kconfig option is enabled.  They must be off by default
> > and guarded by a static branch so the cost is exactly zero.
> 
> Yes, that part is clear and Johannes made it clear that the kmem tcp
> part is disabled by default. Or are you considering also all the slab
> usage by the networking code as well?

Michal, there shouldn't be any tracking or accounting going on per
default when you boot into a fresh system.

I removed all accounting and statistics on the system level in
cgroupv2, so distribution kernels can compile-time enable a single,
feature-complete CONFIG_MEMCG that provides a full memory controller
while at the same time puts no overhead on users that don't benefit
from mem control at all and just want to use the machine bare-metal.

This is completely doable. My new series does it for skmem, but I also
want to retrofit the code to eliminate that current overhead for page
cache, anonymous memory, slab memory and so forth.

This is the only sane way to make the memory controller powerful and
generally useful without having to make unreasonable compromises with
memory consumers. We shouldn't even be *having* the discussion about
whether we should sacrifice the quality of our interface in order to
compromise with a class of users that doesn't care about any of this
in the first place.

So let's eliminate the cost for non-users, but make the memory
controller feature-complete and useful--with reasonable cost,
implementation, and interface--for our actual userbase.

Paying the necessary cost for a functionality you actually want is not
the problem. Paying for something that doesn't benefit you is.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Weiner Nov. 5, 2015, 10:52 p.m. UTC | #22
On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > This would be true if they moved on to the new cgroup API intentionally.
> > The reality is more complicated though. AFAIK sysmted is waiting for
> > cgroup2 already and privileged services enable all available resource
> > controllers by default as I've learned just recently.
> 
> Have you filed a report with them? I don't think they should turn them
> on unless users explicitely configure resource control for the unit.

Okay, verified with systemd people that they're not planning on
enabling resource control per default.

Inflammatory half-truths, man. This is not constructive.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladimir Davydov Nov. 6, 2015, 9:05 a.m. UTC | #23
On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
...
> > 3) keep only some (safe) cache types enabled by default with the current
> >    failing semantic and require an explicit enabling for the complete
> >    kmem accounting. [di]cache code paths should be quite robust to
> >    handle allocation failures.
> 
> Vladimir, what would be your opinion on this?

I'm all for this option. Actually, I've been thinking about this since I
introduced the __GFP_NOACCOUNT flag. Not because of the failing
semantics, since we can always let kmem allocations breach the limit.
This shouldn't be critical, because I don't think it's possible to issue
a series of kmem allocations w/o a single user page allocation, which
would reclaim/kill the excess.

The point is there are allocations that are shared system-wide and
therefore shouldn't go to any memcg. Most obvious examples are: mempool
users and radix_tree/idr preloads. Accounting them to memcg is likely to
result in noticeable memory overhead as memory cgroups are
created/destroyed, because they pin dead memory cgroups with all their
kmem caches, which aren't tiny.

Another funny example is objects destroyed lazily for performance
reasons, e.g. vmap_area. Such objects are usually very small, so
delaying destruction of a bunch of them will normally go unnoticed.
However, if kmemcg is used the effective memory consumption caused by
such objects can be multiplied by many times due to dangling kmem
caches.

We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the
problem is they are tricky to identify, because they are scattered all
over the kernel source tree. E.g. Dave Chinner mentioned that XFS
internals do a lot of allocations that are shared among all XFS
filesystems and therefore should not be accounted (BTW that's why
list_lru's used by XFS are not marked as memcg-aware). There must be
more out there. Besides, kernel developers don't usually even know about
kmemcg (they just write the code for their subsys, so why should they?)
so they won't care thinking about using __GFP_NOACCOUNT, and hence new
falsely-accounted allocations are likely to appear.

That said, by switching from black-list (__GFP_NOACCOUNT) to white-list
(__GFP_ACCOUNT) kmem accounting policy we would make the system more
predictable and robust IMO. OTOH what would we lose? Security? Well,
containers aren't secure IMHO. In fact, I doubt they will ever be (as
secure as VMs). Anyway, if a runaway allocation is reported, it should
be trivial to fix by adding __GFP_ACCOUNT where appropriate.

If there are no objections, I'll prepare a patch switching to the
white-list approach. Let's start from obvious things like fs_struct,
mm_struct, task_struct, signal_struct, dentry, inode, which can be
easily allocated from user space. This should cover 90% of all
allocations that should be accounted AFAICS. The rest will be added
later if necessarily.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko Nov. 6, 2015, 10:57 a.m. UTC | #24
On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > This would be true if they moved on to the new cgroup API intentionally.
> > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > cgroup2 already and privileged services enable all available resource
> > > controllers by default as I've learned just recently.
> > 
> > Have you filed a report with them? I don't think they should turn them
> > on unless users explicitely configure resource control for the unit.
> 
> Okay, verified with systemd people that they're not planning on
> enabling resource control per default.
> 
> Inflammatory half-truths, man. This is not constructive.

What about Delegate=yes feature then? We have just been burnt by this
quite heavily. AFAIU nspawn@.service and nspawn@.service have this
enabled by default
http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html
Michal Hocko Nov. 6, 2015, 12:51 p.m. UTC | #25
On Thu 05-11-15 17:32:51, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 05:28:03PM +0100, Michal Hocko wrote:
[...]
> > Yes, that part is clear and Johannes made it clear that the kmem tcp
> > part is disabled by default. Or are you considering also all the slab
> > usage by the networking code as well?
> 
> Michal, there shouldn't be any tracking or accounting going on per
> default when you boot into a fresh system.
> 
> I removed all accounting and statistics on the system level in
> cgroupv2, so distribution kernels can compile-time enable a single,
> feature-complete CONFIG_MEMCG that provides a full memory controller
> while at the same time puts no overhead on users that don't benefit
> from mem control at all and just want to use the machine bare-metal.

Yes that part is clear and I am not disputing it _at all_. It is just
that changes are high that memory controller _will_ be enabled in a
typical distribution systems. E.g. systemd _is_ enabling all resource
controllers by default for some services with Delegate=yes option.

> This is completely doable. My new series does it for skmem, but I also
> want to retrofit the code to eliminate that current overhead for page
> cache, anonymous memory, slab memory and so forth.
> 
> This is the only sane way to make the memory controller powerful and
> generally useful without having to make unreasonable compromises with
> memory consumers. We shouldn't even be *having* the discussion about
> whether we should sacrifice the quality of our interface in order to
> compromise with a class of users that doesn't care about any of this
> in the first place.
> 
> So let's eliminate the cost for non-users, but make the memory
> controller feature-complete and useful--with reasonable cost,
> implementation, and interface--for our actual userbase.
> 
> Paying the necessary cost for a functionality you actually want is not
> the problem. Paying for something that doesn't benefit you is.

I completely agree that a reasonable cost for those who _want_ the
functionality. It hasn't been shown that people actually lack kmem
accounting in the wild from the past in general. E.g. kmem controller
is even not enabled in opensuse nor SLES kernels and I do not remember
there was huge push to enable it.

I do understand that you want to have an out-of-the-box isolation
behavior which I agree is a nice-to-have feature. Especially with
a larger penetration of containerized workloads. But my point still
holds. This is not something everybody wants to have. So have a
configuration and a boot time option to override is the most reasonable
way to go. You can clearly see that this is already demand from tcp
kmem extension because they really _care_ about every single cpu cycle
even though some part of the userspace happens to have memcg enabled.

The question about the configuration default is a different question
and we can discuss that because this is not an easy one to decide right
now IMHO.
Michal Hocko Nov. 6, 2015, 1:21 p.m. UTC | #26
On Thu 05-11-15 15:55:22, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
[...]
> > This would be true if they moved on to the new cgroup API intentionally.
> > The reality is more complicated though. AFAIK sysmted is waiting for
> > cgroup2 already and privileged services enable all available resource
> > controllers by default as I've learned just recently.
> 
> Have you filed a report with them? I don't think they should turn them
> on unless users explicitely configure resource control for the unit.

We have just been bitten by this (aka Delegate=yes for some basic
services) and our systemd people are supposed to bring this up upstream.
I've mentioned that in other email where you accuse me from spreading a
FUD.
 
> But what I said still holds: critical production machines don't just
> get rolling updates and "accidentally" switch to all this new
> code. And those that do take the plunge have the cmdline options.

That is exactly my point why I do not think re-evaluating the default
config option is a problem at all. The default wouldn't matter for
existing users. Those who care can have all the functionality they need
right away - be it kmem enabled or disabled.

> > > And it makes a lot more sense to account them in excess of the limit
> > > than pretend they don't exist. We might not be able to completely
> > > fullfill the containment part of the memory controller (although these
> > > slab charges will still create significant pressure before that), but
> > > at least we don't fail the accounting part on top of it.
> > 
> > Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any
> > load could simply runaway via kernel allocations. What is even worse we
> > might even not trigger memcg OOM killer before we hit the global OOM. So
> > the whole containment goes straight to hell.
> >
> > I can see four options here:
> > 1) enable kmem by default with the current semantic which we know can
> >    BUG_ON (at least btrfs is known to hit this) or lead to other issues.
> 
> Can you point me to that report?

git grep "BUG_ON.*ENOMEM" -- fs/btrfs

just to give you a picture. Not all of them are kmalloc and others are
not annotated by ENOMEM comment. This came out as a result of my last
attempt to allow GFP_NOFS fail
(http://lkml.kernel.org/r/1438768284-30927-1-git-send-email-mhocko%40kernel.org)
 
> That's not "semantics", that's a bug! Whether or not a feature is
> enabled by default, it can not be allowed to crash the kernel.

Yes those are bugs and have to be fixed. Not an easy task but nothing
which couldn't be solved. It just takes some time. They are not very
likely right now because they are reduced to corner cases right now.
But they are more visible with the current kmem accounting semantic.
So either we change the semantic or wait until this gets fixed if
the accoutning should be on by default.

> Presenting this as a choice is a bit of a strawman argument.
> 
> > 2) enable kmem by default and change the semantic for cgroup2 to allow
> >    runaway charges above the hard limit which would defeat the whole
> >    purpose of the containment for cgroup2. This can be a temporary
> >    workaround until we can afford kmem failures. This has a big risk
> >    that we will end up with this permanently because there is a strong
> >    pressure that GFP_KERNEL allocations should never fail. Yet this is
> >    the most common type of request. Or do we change the consistency with
> >    the global case at some point?
> 
> As per 1) we *have* to fail containment eventually if not doing so
> means crashes and lockups. That's not a choice of semantics.
> 
> But that doesn't mean we have to give up *immediately* and allow
> unrestrained "runaway charges"--again, more of a strawman than a
> choice. We can still throttle the allocator and apply significant
> pressure on the memory pool, culminating in OOM kills eventually.
> 
> Once we run out of available containment tools, however, we *have* to
> follow the semantics of the page and slab allocator and succeed the
> request. We can not just return -ENOMEM if that causes kernel bugs.
> 
> That's the only thing we can do right now.
> 
> In fact, it's likely going to be the best we will ever be able to do
> when it comes to kernel memory accounting. Linus made it clear where
> he stands on failing kernel allocations, so all we can do is continue
> to improve our containment tools and then give up on containment when
> they're exhausted and force the charge past the limit.

OK, then we need all the additional measures to keep the hard limit
excess bound.

> > 3) keep only some (safe) cache types enabled by default with the current
> >    failing semantic and require an explicit enabling for the complete
> >    kmem accounting. [di]cache code paths should be quite robust to
> >    handle allocation failures.
> 
> Vladimir, what would be your opinion on this?
> 
> > 4) disable kmem by default and change the config default later to signal
> >    the accounting is safe as far as we are aware and let people enable
> >    the functionality on those basis. We would keep the current failing
> >    semantic.
> > 
> > To me 4) sounds like the safest option because it still keeps the
> > functionality available to those who can benefit from it in v1 already
> > while we are not exposing a potentially buggy behavior to the majority
> > (many of them even unintentionally). Moreover we still allow to change
> > the default later on an explicit basis.
> 
> I'm not interested in fragmenting the interface forever out of caution
> because there might be a bug in the implementation right now. As I
> said we have to fix any instability in the features we provide whether
> they are turned on by default or not. I don't see how this is relevant
> to the interface discussion.
> 
> Also, there is no way we can later fundamentally change the semantics
> of memory.current, so it would have to remain configurable forever,
> forcing people forever to select multiple options in order to piece
> together a single logical kernel feature.
> 
> This is really not an option, either.

Why not? I can clearly see people who would really want to have this
disabled and doing that by config is much more easier than providing
a command line parameter. A config option doesn't give any additional
maintenance burden than the boot time parameter.
 
> If there are show-stopping bugs in the implementation, I'd rather hold
> off the release of the unified hierarchy than commit to a half-assed
> interface right out of the gate.

If you are willing to postpone releasing cgroup2 until this gets
resolved - one way or another - then I have no objections. My impression
was that Tejun wanted to release it sooner rather than later. As this
mere discussion shows we are even not sure what should be the kmem
failure behavior.

> The point of v2 is sane interfaces.

And the sane interface to me is to use a single set of knobs regardless
of memory type. We are currently only discussing what should be
accounted by default. My understanding of what David said is that tcp
kmem should be enabled only when explicitly opted in. Did I get this
wrong?

> So let's please focus on fixing any problems that slab accounting may
> have, rather than designing complex config options and transition
> procedures whose sole purpose is to defer dealing with our issues.
> 
> Please?
Michal Hocko Nov. 6, 2015, 1:29 p.m. UTC | #27
On Fri 06-11-15 12:05:55, Vladimir Davydov wrote:
[...]
> If there are no objections, I'll prepare a patch switching to the
> white-list approach. Let's start from obvious things like fs_struct,
> mm_struct, task_struct, signal_struct, dentry, inode, which can be
> easily allocated from user space.

pipe buffers, kernel stacks and who knows what more.

> This should cover 90% of all
> allocations that should be accounted AFAICS. The rest will be added
> later if necessarily.

The more I think about that the more I am convinced that is the only
sane way forward. The only concerns I would have is how do we deal with
the old interface in cgroup1? We do not want to break existing
deployments which might depend on the current behavior. I doubt they are
but...
Johannes Weiner Nov. 6, 2015, 4:19 p.m. UTC | #28
On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote:
> On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > > This would be true if they moved on to the new cgroup API intentionally.
> > > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > > cgroup2 already and privileged services enable all available resource
> > > > controllers by default as I've learned just recently.
> > > 
> > > Have you filed a report with them? I don't think they should turn them
> > > on unless users explicitely configure resource control for the unit.
> > 
> > Okay, verified with systemd people that they're not planning on
> > enabling resource control per default.
> > 
> > Inflammatory half-truths, man. This is not constructive.
> 
> What about Delegate=yes feature then? We have just been burnt by this
> quite heavily. AFAIU nspawn@.service and nspawn@.service have this
> enabled by default
> http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html

That's when you launch a *container* and want it to be able to use
nested resource control.

We're talking about actual container users here. It's not turning on
resource control for all "privileged services", which is what we were
worried about here. Can you at least admit that when you yourself link
to the refuting evidence?

And if you've been "burnt quite heavily" by this, where is your bug
report to stop other users from getting "burnt quite heavily" as well?

All I read here is vague inflammatory language to spread FUD.

You might think sending these emails is helpful, but it really
isn't. Not only is it not contributing code, insights, or solutions,
you're now actively sabotaging someone else's effort to build something.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Weiner Nov. 6, 2015, 4:35 p.m. UTC | #29
On Fri, Nov 06, 2015 at 12:05:55PM +0300, Vladimir Davydov wrote:
> On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> ...
> > > 3) keep only some (safe) cache types enabled by default with the current
> > >    failing semantic and require an explicit enabling for the complete
> > >    kmem accounting. [di]cache code paths should be quite robust to
> > >    handle allocation failures.
> > 
> > Vladimir, what would be your opinion on this?
> 
> I'm all for this option. Actually, I've been thinking about this since I
> introduced the __GFP_NOACCOUNT flag. Not because of the failing
> semantics, since we can always let kmem allocations breach the limit.
> This shouldn't be critical, because I don't think it's possible to issue
> a series of kmem allocations w/o a single user page allocation, which
> would reclaim/kill the excess.
> 
> The point is there are allocations that are shared system-wide and
> therefore shouldn't go to any memcg. Most obvious examples are: mempool
> users and radix_tree/idr preloads. Accounting them to memcg is likely to
> result in noticeable memory overhead as memory cgroups are
> created/destroyed, because they pin dead memory cgroups with all their
> kmem caches, which aren't tiny.
> 
> Another funny example is objects destroyed lazily for performance
> reasons, e.g. vmap_area. Such objects are usually very small, so
> delaying destruction of a bunch of them will normally go unnoticed.
> However, if kmemcg is used the effective memory consumption caused by
> such objects can be multiplied by many times due to dangling kmem
> caches.
> 
> We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the
> problem is they are tricky to identify, because they are scattered all
> over the kernel source tree. E.g. Dave Chinner mentioned that XFS
> internals do a lot of allocations that are shared among all XFS
> filesystems and therefore should not be accounted (BTW that's why
> list_lru's used by XFS are not marked as memcg-aware). There must be
> more out there. Besides, kernel developers don't usually even know about
> kmemcg (they just write the code for their subsys, so why should they?)
> so they won't care thinking about using __GFP_NOACCOUNT, and hence new
> falsely-accounted allocations are likely to appear.
> 
> That said, by switching from black-list (__GFP_NOACCOUNT) to white-list
> (__GFP_ACCOUNT) kmem accounting policy we would make the system more
> predictable and robust IMO. OTOH what would we lose? Security? Well,
> containers aren't secure IMHO. In fact, I doubt they will ever be (as
> secure as VMs). Anyway, if a runaway allocation is reported, it should
> be trivial to fix by adding __GFP_ACCOUNT where appropriate.

I wholeheartedly agree with all of this.

> If there are no objections, I'll prepare a patch switching to the
> white-list approach. Let's start from obvious things like fs_struct,
> mm_struct, task_struct, signal_struct, dentry, inode, which can be
> easily allocated from user space. This should cover 90% of all
> allocations that should be accounted AFAICS. The rest will be added
> later if necessarily.

Awesome, I'm looking forward to that patch!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko Nov. 6, 2015, 4:46 p.m. UTC | #30
On Fri 06-11-15 11:19:53, Johannes Weiner wrote:
> On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote:
> > On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > > > This would be true if they moved on to the new cgroup API intentionally.
> > > > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > > > cgroup2 already and privileged services enable all available resource
> > > > > controllers by default as I've learned just recently.
> > > > 
> > > > Have you filed a report with them? I don't think they should turn them
> > > > on unless users explicitely configure resource control for the unit.
> > > 
> > > Okay, verified with systemd people that they're not planning on
> > > enabling resource control per default.
> > > 
> > > Inflammatory half-truths, man. This is not constructive.
> > 
> > What about Delegate=yes feature then? We have just been burnt by this
> > quite heavily. AFAIU nspawn@.service and nspawn@.service have this
> > enabled by default
> > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html
> 
> That's when you launch a *container* and want it to be able to use
> nested resource control.

Ups. copy&paste error here. The second one was user@.service. So it is
not only about containers AFAIU but all user defined sessions.

> We're talking about actual container users here. It's not turning on
> resource control for all "privileged services", which is what we were
> worried about here. Can you at least admit that when you yourself link
> to the refuting evidence?

My bad, that was misundestanding of the changelog.

> And if you've been "burnt quite heavily" by this, where is your bug
> report to stop other users from getting "burnt quite heavily" as well?

The bug report is still internal because it is tracking an unrelased
product. We have ended up reverting Delegate feature. Our systemd
developers are supposed to bring this up with the upstream.

The basic problem was that the Delegate feature has been backported to
our systemd package without further consideration and that has
invalidated a lot of performance testing because some resource
controllers have measurable effects on those benchmarks.

> All I read here is vague inflammatory language to spread FUD.

I was merely pointing out that memory controller might be enabled without
_user_ actually even noticing because the controller wasn't enabled
explicitly. I haven't blamed anybody for that.

> You might think sending these emails is helpful, but it really
> isn't. Not only is it not contributing code, insights, or solutions,
> you're now actively sabotaging someone else's effort to build something.

Come on! Are you even serious?
Johannes Weiner Nov. 6, 2015, 5:45 p.m. UTC | #31
On Fri, Nov 06, 2015 at 05:46:57PM +0100, Michal Hocko wrote:
> The basic problem was that the Delegate feature has been backported to
> our systemd package without further consideration and that has
> invalidated a lot of performance testing because some resource
> controllers have measurable effects on those benchmarks.

You're talking about a userspace bug. No amount of fragmenting and
layering and opt-in in the kernel's runtime configuration space is
going to help you if you screw up and enable it all by accident.

> > All I read here is vague inflammatory language to spread FUD.
> 
> I was merely pointing out that memory controller might be enabled without
> _user_ actually even noticing because the controller wasn't enabled
> explicitly. I haven't blamed anybody for that.

Why does that have anything to do with how we design our interface?

We can't do more than present a sane interface in good faith and lobby
userspace projects if we think they misuse it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Nov. 7, 2015, 3:45 a.m. UTC | #32
From: Michal Hocko <mhocko@kernel.org>
Date: Fri, 6 Nov 2015 17:46:57 +0100

> On Fri 06-11-15 11:19:53, Johannes Weiner wrote:
>> You might think sending these emails is helpful, but it really
>> isn't. Not only is it not contributing code, insights, or solutions,
>> you're now actively sabotaging someone else's effort to build something.
> 
> Come on! Are you even serious?

He is, and I agree %100 with him FWIW.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman Nov. 12, 2015, 6:36 p.m. UTC | #33
On Fri, Nov 06, 2015 at 11:19:53AM -0500, Johannes Weiner wrote:
> On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote:
> > On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > > > This would be true if they moved on to the new cgroup API intentionally.
> > > > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > > > cgroup2 already and privileged services enable all available resource
> > > > > controllers by default as I've learned just recently.
> > > > 
> > > > Have you filed a report with them? I don't think they should turn them
> > > > on unless users explicitely configure resource control for the unit.
> > > 
> > > Okay, verified with systemd people that they're not planning on
> > > enabling resource control per default.
> > > 
> > > Inflammatory half-truths, man. This is not constructive.
> > 
> > What about Delegate=yes feature then? We have just been burnt by this
> > quite heavily. AFAIU nspawn@.service and nspawn@.service have this
> > enabled by default
> > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html
> 
> That's when you launch a *container* and want it to be able to use
> nested resource control.
> 
> We're talking about actual container users here. It's not turning on
> resource control for all "privileged services", which is what we were
> worried about here. Can you at least admit that when you yourself link
> to the refuting evidence?
> 
> And if you've been "burnt quite heavily" by this, where is your bug
> report to stop other users from getting "burnt quite heavily" as well?
> 

I didn't read this thread in detail but the lack of public information on
problems with cgroup controllers is partially my fault so I'd like to correct
that.

https://bugzilla.suse.com/show_bug.cgi?id=954765

This bug documents some of the impact that was incurred due to ssh
sessions being resource controlled by default. It talks primarily about
pipetest being impacted by cpu,cpuacct. It is also found in the recent
past that dbench4 was previously impacted because the blkio controller
was enabled. That bug is not public but basically dbench4 regressed 80% as
the journal thread was in a different cgroup than dbench4. dbench4 would
stall for 8ms in case more IO was issued before the journal thread could
issue any IO. The opensuse bug 954765 bug is not affected by blkio because
it's disabled by a distribution-specific patch. Mike Galbraith adds some
additional information on why activating the cpu controller can have an
impact on semantics even if the overhead was zero.

It may be the case that it's an oversight by the systemd developers and the
intent was only to affect containers. My experience was that everything
was affected. It also may be the case that this is an opensuse-specific
problem due to how the maintainers packaged systemd. I don't actually
know and hopefully the bug will be able to determine if upstream is really
affected or not.

There is also a link to this bug on the upstream project so there is some
chance they are aware https://github.com/systemd/systemd/issues/1715

Bottom line, there is legimate confusion over whether cgroup controllers
are going to be enabled by default or not in the future. If they are
enabled by default, there is a non-zero cost to that and a change in
semantics that people may or may not be surprised by.
Johannes Weiner Nov. 12, 2015, 7:12 p.m. UTC | #34
On Thu, Nov 12, 2015 at 06:36:20PM +0000, Mel Gorman wrote:
> Bottom line, there is legimate confusion over whether cgroup controllers
> are going to be enabled by default or not in the future. If they are
> enabled by default, there is a non-zero cost to that and a change in
> semantics that people may or may not be surprised by.

Thanks for elaborating, Mel.

My understanding is that this is a plain bug. I don't think anybody
wants to put costs without benefits on their users.

But I'll keep an eye on these reports, and I'll work with the systemd
people should issues with the kernel interface materialize that would
force them to enable resource control prematurely.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5b72f83..6f1e0f8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -244,6 +244,10 @@  struct mem_cgroup {
 	struct wb_domain cgwb_domain;
 #endif
 
+#ifdef CONFIG_INET
+	struct work_struct socket_work;
+#endif
+
 	/* List of events which userspace want to receive */
 	struct list_head event_list;
 	spinlock_t event_list_lock;
@@ -676,11 +680,15 @@  static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 struct sock;
-#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
-extern struct static_key_false mem_cgroup_sockets;
+#ifdef CONFIG_INET
+extern struct static_key_true mem_cgroup_sockets;
 static inline bool mem_cgroup_do_sockets(void)
 {
-	return static_branch_unlikely(&mem_cgroup_sockets);
+	if (mem_cgroup_disabled())
+		return false;
+	if (!static_branch_likely(&mem_cgroup_sockets))
+		return false;
+	return true;
 }
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
@@ -706,7 +714,7 @@  static inline void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg,
 					     unsigned int nr_pages)
 {
 }
-#endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
+#endif /* CONFIG_INET */
 
 #ifdef CONFIG_MEMCG_KMEM
 extern struct static_key memcg_kmem_enabled_key;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3789050..cb1d6aa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1916,6 +1916,18 @@  static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
+static void reclaim_high(struct mem_cgroup *memcg,
+			 unsigned int nr_pages,
+			 gfp_t gfp_mask)
+{
+	do {
+		if (page_counter_read(&memcg->memory) <= memcg->high)
+			continue;
+		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
+		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+	} while ((memcg = parent_mem_cgroup(memcg)));
+}
+
 /*
  * Scheduled by try_charge() to be executed from the userland return path
  * and reclaims memory over the high limit.
@@ -1923,20 +1935,13 @@  static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 void mem_cgroup_handle_over_high(void)
 {
 	unsigned int nr_pages = current->memcg_nr_pages_over_high;
-	struct mem_cgroup *memcg, *pos;
+	struct mem_cgroup *memcg;
 
 	if (likely(!nr_pages))
 		return;
 
-	pos = memcg = get_mem_cgroup_from_mm(current->mm);
-
-	do {
-		if (page_counter_read(&pos->memory) <= pos->high)
-			continue;
-		mem_cgroup_events(pos, MEMCG_HIGH, 1);
-		try_to_free_mem_cgroup_pages(pos, nr_pages, GFP_KERNEL, true);
-	} while ((pos = parent_mem_cgroup(pos)));
-
+	memcg = get_mem_cgroup_from_mm(current->mm);
+	reclaim_high(memcg, nr_pages, GFP_KERNEL);
 	css_put(&memcg->css);
 	current->memcg_nr_pages_over_high = 0;
 }
@@ -4129,6 +4134,8 @@  struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 }
 EXPORT_SYMBOL(parent_mem_cgroup);
 
+static void socket_work_func(struct work_struct *work);
+
 static struct cgroup_subsys_state * __ref
 mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -4169,6 +4176,9 @@  mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
 #endif
+#ifdef CONFIG_INET
+	INIT_WORK(&memcg->socket_work, socket_work_func);
+#endif
 	return &memcg->css;
 
 free_out:
@@ -4266,6 +4276,8 @@  static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
+	cancel_work_sync(&memcg->socket_work);
+
 	memcg_destroy_kmem(memcg);
 	__mem_cgroup_free(memcg);
 }
@@ -4948,10 +4960,15 @@  static void mem_cgroup_bind(struct cgroup_subsys_state *root_css)
 	 * guarantees that @root doesn't have any children, so turning it
 	 * on for the root memcg is enough.
 	 */
-	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
+	if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
 		root_mem_cgroup->use_hierarchy = true;
-	else
+#ifdef CONFIG_INET
+		/* unified hierarchy always counts skmem */
+		static_branch_enable(&mem_cgroup_sockets);
+#endif
+	} else {
 		root_mem_cgroup->use_hierarchy = false;
+	}
 }
 
 static u64 memory_current_read(struct cgroup_subsys_state *css,
@@ -5453,10 +5470,9 @@  void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage)
 	commit_charge(newpage, memcg, true);
 }
 
-/* Writing them here to avoid exposing memcg's inner layout */
-#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+#ifdef CONFIG_INET
 
-DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
+DEFINE_STATIC_KEY_TRUE(mem_cgroup_sockets);
 
 void sock_update_memcg(struct sock *sk)
 {
@@ -5490,6 +5506,14 @@  void sock_release_memcg(struct sock *sk)
 		css_put(&sk->sk_memcg->css);
 }
 
+static void socket_work_func(struct work_struct *work)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = container_of(work, struct mem_cgroup, socket_work);
+	reclaim_high(memcg, CHARGE_BATCH, GFP_KERNEL);
+}
+
 /**
  * mem_cgroup_charge_skmem - charge socket memory
  * @memcg: memcg to charge
@@ -5500,13 +5524,38 @@  void sock_release_memcg(struct sock *sk)
  */
 bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
+	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	struct page_counter *counter;
+	bool force = false;
 
-	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
+		if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
+			return true;
+		page_counter_charge(&memcg->skmem, nr_pages);
+		return false;
+	}
+
+	if (consume_stock(memcg, nr_pages))
 		return true;
+retry:
+	if (page_counter_try_charge(&memcg->memory, batch, &counter))
+		goto done;
 
-	page_counter_charge(&memcg->skmem, nr_pages);
-	return false;
+	if (batch > nr_pages) {
+		batch = nr_pages;
+		goto retry;
+	}
+
+	force = true;
+	page_counter_charge(&memcg->memory, batch);
+done:
+	css_get_many(&memcg->css, batch);
+	if (batch > nr_pages)
+		refill_stock(memcg, batch - nr_pages);
+
+	schedule_work(&memcg->socket_work);
+
+	return !force;
 }
 
 /**
@@ -5516,10 +5565,16 @@  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
  */
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-	page_counter_uncharge(&memcg->skmem, nr_pages);
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
+		page_counter_uncharge(&memcg->skmem, nr_pages);
+		return;
+	}
+
+	page_counter_uncharge(&memcg->memory, nr_pages);
+	css_put_many(&memcg->css, nr_pages);
 }
 
-#endif
+#endif /* CONFIG_INET */
 
 /*
  * subsys_initcall() for memory controller.